Case study
Fiber Route Diversity and Backhaul Restoration Case Study
Telecommunications case study of a fiber backhaul outage caused by shared physical route risk, covering route diversity, failover capacity, latency, restoration evidence, and operational lessons.
This case study follows a realistic telecommunications outage: a regional service site loses both nominally redundant fiber backhaul circuits after one civil-work incident. The topology diagram showed two links. The field route records showed a different truth: both circuits shared the same bridge crossing and entered the site through the same duct bank.
The case is not about fiber being unreliable. Fiber links can be extremely reliable. The case is about a common engineering mistake: treating logical redundancy as physical diversity. A resilient service needs route evidence, failure-domain mapping, failover capacity, traffic prioritization, monitoring, and restoration records that operations can use under pressure.
Case Summary
| Item | Engineering relevance |
|---|---|
| Service | Backhaul for a remote operations, telemetry, and public communication site. |
| Normal architecture | Two leased fiber circuits from different providers plus a lower-capacity microwave backup. |
| Trigger event | Construction damage at a bridge duct crossing. |
| Hidden weakness | Both fiber circuits used the same physical crossing and same site-entry duct. |
| Main consequence | The site entered degraded service on microwave backup with limited capacity and higher timing variation. |
| Useful outcome | Route-diversity audit, failover policy correction, restoration evidence, and monitoring thresholds. |
The central engineering question is:
Did the service have true physical diversity, and could it remain useful when the assumed diverse fibers failed together?
The answer was no for the original design, but the incident created the evidence needed to redesign the service boundary.
Initial Architecture
The site supports:
- operational voice and messaging;
- telemetry from remote equipment;
- maintenance access and monitoring;
- ordinary user data traffic;
- emergency coordination during severe weather.
The network design lists three backhaul paths:
| Path | Nominal capacity | Expected role |
|---|---|---|
| Fiber A | 1.0\ \text{Gbit/s} | primary service path |
| Fiber B | 1.0\ \text{Gbit/s} | redundant service path |
| Microwave backup | 180\ \text{Mbit/s} | degraded service path |
The operations dashboard marks the site as protected because two fiber carriers are present. The design review, however, had not required proof of physical separation between ducts, poles, bridges, building entry, patch panels, and local power.
Operating Requirement
The site has three traffic classes:
| Traffic class | Required throughput | Latency objective | Loss tolerance | Priority |
|---|---|---|---|---|
| Critical voice and control | 25\ \text{Mbit/s} | less than 40\ \text{ms} one way | very low | highest |
| Telemetry and monitoring | 60\ \text{Mbit/s} | less than 80\ \text{ms} one way | moderate | high |
| General data and maintenance | 350\ \text{Mbit/s} peak | best effort | tolerant | low |
Normal peak demand can exceed 400\ \text{Mbit/s}, but the microwave backup can carry only 180\ \text{Mbit/s} under good RF conditions. Therefore degraded operation must use traffic prioritization. The backup path cannot preserve all normal services.
Event Timeline
The outage sequence is reconstructed from alarms, provider tickets, site logs, and field reports.
| Time | Event |
|---|---|
| 08:12 | Fiber A reports loss of light. Traffic moves to Fiber B. |
| 08:14 | Fiber B reports loss of light. Site failover starts microwave backup. |
| 08:17 | Monitoring shows packet loss and high queue delay on low-priority traffic. |
| 08:25 | Provider notices both fiber circuits cross the same bridge duct segment. |
| 09:10 | Field crew confirms duct damage from construction work. |
| 09:35 | Operations applies degraded-service traffic policy. |
| 13:20 | Temporary fiber splice restores Fiber A. |
| 16:40 | Fiber B restored through same duct, but diversity remains unresolved. |
| Next week | Design team opens route-diversity correction project. |
The first operational mistake was assuming Fiber B represented an independent failure domain. It did not. The second was letting general data traffic compete with critical traffic during the first degraded interval.
Capacity Check During Degraded Operation
During the initial failover interval, measured offered load is:
| Traffic class | Offered load |
|---|---|
| Critical voice and control | 22\ \text{Mbit/s} |
| Telemetry and monitoring | 54\ \text{Mbit/s} |
| General data and maintenance | 290\ \text{Mbit/s} |
Total offered load:
Microwave backup capacity:
Overload ratio:
The backup path is offered about 203\% of its capacity. Congestion is expected unless low-priority traffic is shaped or dropped.
If the site admits only the critical and telemetry classes:
Utilization on the backup path becomes:
Protected traffic uses about 42\% of the backup capacity, leaving margin for protocol overhead, burstiness, retransmission, management traffic, and RF modulation changes.
Engineering Interpretation
The backup link was not undersized for the essential service. It was undersized for the unfiltered service. The engineering failure was not only a physical route problem; it was also a traffic policy problem. Degraded operation must be designed, not discovered during the event.
Latency and Jitter Evidence
Before traffic policy correction, the microwave backup shows:
| Metric | Measured value |
|---|---|
| 95th percentile one-way latency | 126\ \text{ms} |
| Peak-to-peak jitter | 72\ \text{ms} |
| Packet loss | 3.8\% |
After low-priority traffic is rate-limited and bulk maintenance flows are blocked:
| Metric | Measured value |
|---|---|
| 95th percentile one-way latency | 31\ \text{ms} |
| Peak-to-peak jitter | 14\ \text{ms} |
| Packet loss | 0.05\% |
The protected service then meets the critical voice and control latency target:
It also stays below the telemetry target:
Engineering Interpretation
The microwave path had enough technical capacity for protected traffic, but it needed an explicit degraded-service policy. Without that policy, queueing delay dominated performance. This is why service assurance must include traffic classes, not only physical links.
Route Diversity Audit
After restoration, the team audits the physical dependency chain. The audit checks whether supposedly redundant services share any single failure point.
| Dependency | Fiber A | Fiber B | Independent? |
|---|---|---|---|
| Long-haul provider | Provider 1 | Provider 2 | yes |
| Regional metro ring | North ring | South ring | yes |
| River crossing | Bridge duct 4 | Bridge duct 4 | no |
| Site-entry duct | East duct bank | East duct bank | no |
| Building patch room | Room A | Room A | no |
| DC power plant | Power plant 1 | Power plant 1 | no |
The providers are different, but the local crossing and site entry are not. The network topology was diverse at a carrier layer and not diverse at the physical route layer.
The audit defines a shared-risk group:
Any service that relies on two circuits inside SRLG_1 should not be counted as physically diverse.
Restoration Decision
The first restoration option is to repair both fibers through the damaged bridge duct. That restores capacity quickly but does not correct diversity. The second option is to keep one circuit on the repaired bridge route and procure a second path through a separate river crossing and west site entry. That takes longer but removes the correlated failure.
The team separates immediate restoration from permanent remediation:
- Restore Fiber A through the temporary splice for capacity.
- Keep microwave backup active and monitored until both fiber services are stable.
- Restore Fiber B only as temporary service, not as accepted diversity.
- Open a route-diversity remediation package for a physically separate path.
- Update service records so operations do not count Fiber B as independent until the route changes.
Engineering Interpretation
The repair that returns traffic is not necessarily the repair that restores resilience. Service restoration and resilience restoration are different states. The closeout record should say which state has been achieved.
Failure Modes Exposed
| Failure mode | Evidence | Corrective control |
|---|---|---|
| false physical diversity | both providers used bridge duct 4 | require route evidence and SRLG mapping |
| unprotected site entry | both circuits entered east duct bank | add west entry or alternate aerial route |
| backup congestion | offered load exceeded backup capacity | degraded-service traffic policy |
| weak handover records | operations trusted topology diagram | attach route map and dependency record |
| alarm ambiguity | both fibers failed as separate tickets | correlate alarms by site and route group |
| restoration ambiguity | service restored before diversity restored | split service-restored and resilience-restored states |
The important lesson is not that every site needs fully separate everything. The lesson is that the claimed availability must match the actual failure domains.
Validation After Remediation
The permanent remediation adds a second physical path through a west entrance and a different river crossing. Validation evidence includes:
- provider route confirmation with map references;
- site walkdown photos for east and west duct entries;
- optical power baseline for both paths;
- optical time-domain reflectometry traces for final routes;
- failover test from Fiber A to Fiber B;
- failover test from fiber service to microwave backup;
- traffic policy test under degraded operation;
- monitoring alarms tied to route group and service impact.
Example post-remediation acceptance values:
| Test | Result | Decision |
|---|---|---|
| Fiber A optical margin | 7.8\ \text{dB} | pass |
| Fiber B optical margin | 8.4\ \text{dB} | pass |
| Fiber A to Fiber B failover | 620\ \text{ms} service interruption | pass for this service |
| Fiber to microwave failover | 2.8\ \text{s} degraded transition | pass with traffic policy |
| Protected traffic on microwave | 33\ \text{ms} 95th percentile latency | pass |
| Bulk traffic during microwave mode | rate-limited to 70\ \text{Mbit/s} | pass |
The evidence supports a new operating statement: the site has two physically separated fiber paths for normal resilience and a microwave degraded-service path for temporary continuity.
Lessons for Engineering Practice
Route diversity must be proven at the layer where the failure occurs. Carrier diversity, VLAN diversity, router diversity, and logical topology diversity do not prove physical diversity. A backhoe, flood, bridge fire, building-entry failure, or patch-room error follows geography and process, not the network diagram.
Useful review questions are:
- Do redundant circuits share ducts, bridges, poles, trays, risers, patch rooms, power, or maintenance procedures?
- Does the backup path have enough capacity for the protected service, not the full normal load?
- Are traffic classes enforced before queueing destroys latency and jitter?
- Are alarms correlated by site impact and shared-risk group?
- Does the restoration report distinguish restored traffic from restored resilience?
- Can future engineers find the evidence without reconstructing the incident?
Transferable Takeaways
The case transfers to data centers, industrial plants, emergency networks, cellular backhaul, ports, campuses, mines, hospitals, and transportation systems. The same pattern appears whenever a service is declared redundant from a logical diagram while physical dependencies remain hidden.
A strong telecommunications design does not merely add backup links. It makes failure domains visible, sizes degraded service intentionally, tests failover under load, and leaves records that operations can trust during the next incident.