Case study

Fiber Route Diversity and Backhaul Restoration Case Study

Telecommunications case study of a fiber backhaul outage caused by shared physical route risk, covering route diversity, failover capacity, latency, restoration evidence, and operational lessons.

This case study follows a realistic telecommunications outage: a regional service site loses both nominally redundant fiber backhaul circuits after one civil-work incident. The topology diagram showed two links. The field route records showed a different truth: both circuits shared the same bridge crossing and entered the site through the same duct bank.

The case is not about fiber being unreliable. Fiber links can be extremely reliable. The case is about a common engineering mistake: treating logical redundancy as physical diversity. A resilient service needs route evidence, failure-domain mapping, failover capacity, traffic prioritization, monitoring, and restoration records that operations can use under pressure.

Case Summary

ItemEngineering relevance
ServiceBackhaul for a remote operations, telemetry, and public communication site.
Normal architectureTwo leased fiber circuits from different providers plus a lower-capacity microwave backup.
Trigger eventConstruction damage at a bridge duct crossing.
Hidden weaknessBoth fiber circuits used the same physical crossing and same site-entry duct.
Main consequenceThe site entered degraded service on microwave backup with limited capacity and higher timing variation.
Useful outcomeRoute-diversity audit, failover policy correction, restoration evidence, and monitoring thresholds.

The central engineering question is:

Did the service have true physical diversity, and could it remain useful when the assumed diverse fibers failed together?

The answer was no for the original design, but the incident created the evidence needed to redesign the service boundary.

Initial Architecture

The site supports:

  • operational voice and messaging;
  • telemetry from remote equipment;
  • maintenance access and monitoring;
  • ordinary user data traffic;
  • emergency coordination during severe weather.

The network design lists three backhaul paths:

PathNominal capacityExpected role
Fiber A1.0\ \text{Gbit/s}primary service path
Fiber B1.0\ \text{Gbit/s}redundant service path
Microwave backup180\ \text{Mbit/s}degraded service path

The operations dashboard marks the site as protected because two fiber carriers are present. The design review, however, had not required proof of physical separation between ducts, poles, bridges, building entry, patch panels, and local power.

Operating Requirement

The site has three traffic classes:

Traffic classRequired throughputLatency objectiveLoss tolerancePriority
Critical voice and control25\ \text{Mbit/s}less than 40\ \text{ms} one wayvery lowhighest
Telemetry and monitoring60\ \text{Mbit/s}less than 80\ \text{ms} one waymoderatehigh
General data and maintenance350\ \text{Mbit/s} peakbest efforttolerantlow

Normal peak demand can exceed 400\ \text{Mbit/s}, but the microwave backup can carry only 180\ \text{Mbit/s} under good RF conditions. Therefore degraded operation must use traffic prioritization. The backup path cannot preserve all normal services.

Event Timeline

The outage sequence is reconstructed from alarms, provider tickets, site logs, and field reports.

TimeEvent
08:12Fiber A reports loss of light. Traffic moves to Fiber B.
08:14Fiber B reports loss of light. Site failover starts microwave backup.
08:17Monitoring shows packet loss and high queue delay on low-priority traffic.
08:25Provider notices both fiber circuits cross the same bridge duct segment.
09:10Field crew confirms duct damage from construction work.
09:35Operations applies degraded-service traffic policy.
13:20Temporary fiber splice restores Fiber A.
16:40Fiber B restored through same duct, but diversity remains unresolved.
Next weekDesign team opens route-diversity correction project.

The first operational mistake was assuming Fiber B represented an independent failure domain. It did not. The second was letting general data traffic compete with critical traffic during the first degraded interval.

Capacity Check During Degraded Operation

During the initial failover interval, measured offered load is:

Traffic classOffered load
Critical voice and control22\ \text{Mbit/s}
Telemetry and monitoring54\ \text{Mbit/s}
General data and maintenance290\ \text{Mbit/s}

Total offered load:

R_{offered}=22+54+290=366\ \text{Mbit/s}

Microwave backup capacity:

R_{backup}=180\ \text{Mbit/s}

Overload ratio:

\displaystyle \rho=\frac{R_{offered}}{R_{backup}}=\frac{366}{180}=2.03

The backup path is offered about 203\% of its capacity. Congestion is expected unless low-priority traffic is shaped or dropped.

If the site admits only the critical and telemetry classes:

R_{protected}=22+54=76\ \text{Mbit/s}

Utilization on the backup path becomes:

\displaystyle u=\frac{76}{180}=0.422

Protected traffic uses about 42\% of the backup capacity, leaving margin for protocol overhead, burstiness, retransmission, management traffic, and RF modulation changes.

Engineering Interpretation

The backup link was not undersized for the essential service. It was undersized for the unfiltered service. The engineering failure was not only a physical route problem; it was also a traffic policy problem. Degraded operation must be designed, not discovered during the event.

Latency and Jitter Evidence

Before traffic policy correction, the microwave backup shows:

MetricMeasured value
95th percentile one-way latency126\ \text{ms}
Peak-to-peak jitter72\ \text{ms}
Packet loss3.8\%

After low-priority traffic is rate-limited and bulk maintenance flows are blocked:

MetricMeasured value
95th percentile one-way latency31\ \text{ms}
Peak-to-peak jitter14\ \text{ms}
Packet loss0.05\%

The protected service then meets the critical voice and control latency target:

31\ \text{ms}<40\ \text{ms}

It also stays below the telemetry target:

31\ \text{ms}<80\ \text{ms}

Engineering Interpretation

The microwave path had enough technical capacity for protected traffic, but it needed an explicit degraded-service policy. Without that policy, queueing delay dominated performance. This is why service assurance must include traffic classes, not only physical links.

Route Diversity Audit

After restoration, the team audits the physical dependency chain. The audit checks whether supposedly redundant services share any single failure point.

DependencyFiber AFiber BIndependent?
Long-haul providerProvider 1Provider 2yes
Regional metro ringNorth ringSouth ringyes
River crossingBridge duct 4Bridge duct 4no
Site-entry ductEast duct bankEast duct bankno
Building patch roomRoom ARoom Ano
DC power plantPower plant 1Power plant 1no

The providers are different, but the local crossing and site entry are not. The network topology was diverse at a carrier layer and not diverse at the physical route layer.

The audit defines a shared-risk group:

SRLG_1=\{\text{bridge duct 4},\ \text{east site-entry duct},\ \text{patch room A}\}

Any service that relies on two circuits inside SRLG_1 should not be counted as physically diverse.

Restoration Decision

The first restoration option is to repair both fibers through the damaged bridge duct. That restores capacity quickly but does not correct diversity. The second option is to keep one circuit on the repaired bridge route and procure a second path through a separate river crossing and west site entry. That takes longer but removes the correlated failure.

The team separates immediate restoration from permanent remediation:

  1. Restore Fiber A through the temporary splice for capacity.
  2. Keep microwave backup active and monitored until both fiber services are stable.
  3. Restore Fiber B only as temporary service, not as accepted diversity.
  4. Open a route-diversity remediation package for a physically separate path.
  5. Update service records so operations do not count Fiber B as independent until the route changes.

Engineering Interpretation

The repair that returns traffic is not necessarily the repair that restores resilience. Service restoration and resilience restoration are different states. The closeout record should say which state has been achieved.

Failure Modes Exposed

Failure modeEvidenceCorrective control
false physical diversityboth providers used bridge duct 4require route evidence and SRLG mapping
unprotected site entryboth circuits entered east duct bankadd west entry or alternate aerial route
backup congestionoffered load exceeded backup capacitydegraded-service traffic policy
weak handover recordsoperations trusted topology diagramattach route map and dependency record
alarm ambiguityboth fibers failed as separate ticketscorrelate alarms by site and route group
restoration ambiguityservice restored before diversity restoredsplit service-restored and resilience-restored states

The important lesson is not that every site needs fully separate everything. The lesson is that the claimed availability must match the actual failure domains.

Validation After Remediation

The permanent remediation adds a second physical path through a west entrance and a different river crossing. Validation evidence includes:

  • provider route confirmation with map references;
  • site walkdown photos for east and west duct entries;
  • optical power baseline for both paths;
  • optical time-domain reflectometry traces for final routes;
  • failover test from Fiber A to Fiber B;
  • failover test from fiber service to microwave backup;
  • traffic policy test under degraded operation;
  • monitoring alarms tied to route group and service impact.

Example post-remediation acceptance values:

TestResultDecision
Fiber A optical margin7.8\ \text{dB}pass
Fiber B optical margin8.4\ \text{dB}pass
Fiber A to Fiber B failover620\ \text{ms} service interruptionpass for this service
Fiber to microwave failover2.8\ \text{s} degraded transitionpass with traffic policy
Protected traffic on microwave33\ \text{ms} 95th percentile latencypass
Bulk traffic during microwave moderate-limited to 70\ \text{Mbit/s}pass

The evidence supports a new operating statement: the site has two physically separated fiber paths for normal resilience and a microwave degraded-service path for temporary continuity.

Lessons for Engineering Practice

Route diversity must be proven at the layer where the failure occurs. Carrier diversity, VLAN diversity, router diversity, and logical topology diversity do not prove physical diversity. A backhoe, flood, bridge fire, building-entry failure, or patch-room error follows geography and process, not the network diagram.

Useful review questions are:

  1. Do redundant circuits share ducts, bridges, poles, trays, risers, patch rooms, power, or maintenance procedures?
  2. Does the backup path have enough capacity for the protected service, not the full normal load?
  3. Are traffic classes enforced before queueing destroys latency and jitter?
  4. Are alarms correlated by site impact and shared-risk group?
  5. Does the restoration report distinguish restored traffic from restored resilience?
  6. Can future engineers find the evidence without reconstructing the incident?

Transferable Takeaways

The case transfers to data centers, industrial plants, emergency networks, cellular backhaul, ports, campuses, mines, hospitals, and transportation systems. The same pattern appears whenever a service is declared redundant from a logical diagram while physical dependencies remain hidden.

A strong telecommunications design does not merely add backup links. It makes failure domains visible, sizes degraded service intentionally, tests failover under load, and leaves records that operations can trust during the next incident.

REF

See also