Case study

Packet QoS Misclassification and Queue Starvation Case Study

Packet QoS misclassification case study for DSCP trust boundaries, strict-priority queues, jitter, packet loss, corrective actions, and validation evidence.

This case study analyzes a packet-network failure where quality-of-service markings were trusted at the wrong boundary. A video stream entered the strict-priority queue, consumed most of a degraded microwave backhaul link, and starved a telemetry class that had been correctly engineered on paper.

The case is representative of real telecommunications service-assurance failures: the physical link was still up, the nominal topology was redundant, and average utilization looked explainable, but class counters and delay percentiles showed that the service had lost its traffic separation.

System Context

A remote operations site connects to a control center through a packet backhaul network. The primary fiber path normally carries the traffic. A microwave path provides backup during fiber maintenance and restoration events. The backup path uses adaptive modulation, so available service capacity can fall during rain fade.

The site carries four traffic groups:

Traffic groupIntended classEngineering requirement
control telemetryassured low-latency classone-way p95 latency below 25\ \text{ms}
operational voicestrict-priority classlow jitter, low packet loss
management trafficcontrolled classreachable during faults
surveillance video and file transferbest-effort classmay be shaped or dropped during degraded operation

The intended degraded-mode policy is:

ClassIntended degraded service
strict-priority voicecapped at 2\ \text{Mbit/s}
assured telemetryminimum 5\ \text{Mbit/s}
managementminimum 0.5\ \text{Mbit/s}
best effortremaining capacity only

The service-design review assumed that surveillance video would remain best effort. That assumption was not validated at the ingress boundary.

Incident Symptoms

During a planned fiber outage, traffic moved onto the microwave backup. Light rain reduced the microwave service capacity from 20\ \text{Mbit/s} to 10\ \text{Mbit/s} through adaptive modulation and coding fallback.

Operations observed:

  1. telemetry update gaps of several seconds;
  2. p95 telemetry latency spikes above 800\ \text{ms};
  3. intermittent voice distortion;
  4. packet drops in the assured telemetry queue;
  5. no complete link-down alarm;
  6. high strict-priority queue occupancy during the event.

The first misleading interpretation was “rain fade reduced bandwidth.” That was true, but incomplete. The root cause was not only reduced capacity. It was reduced capacity combined with a QoS trust-boundary error.

Evidence Collected

Packet captures and queue counters showed:

Evidence itemObservation
video gateway markingsurveillance stream sent with DSCP 46
access switch policycustomer DSCP markings were trusted
microwave egress policyDSCP 46 mapped to strict-priority queue
priority policerabsent in the deployed backup template
telemetry markingtelemetry packets remained correctly marked for assured service
queue counter trendstrict-priority queue dominated the egress scheduler
telemetry classcorrect packets waited behind residual capacity

The service had two simultaneous errors:

  1. the ingress boundary trusted a non-critical source marking;
  2. the backup egress template did not cap strict-priority traffic.

The result was starvation of the assured class when the microwave link fell to degraded capacity.

Capacity Check Before the Fault

Under normal backup capacity, the microwave path provided:

R_{normal}=20\ \text{Mbit/s}

The misclassified surveillance stream and real voice traffic entering strict priority were:

A_{prio}=8.5+0.8=9.3\ \text{Mbit/s}

Residual capacity after strict priority was:

R_{res,normal}=20-9.3=10.7\ \text{Mbit/s}

Telemetry demand was:

A_{tel}=3.2\ \text{Mbit/s}

With 10.7\ \text{Mbit/s} residual capacity, telemetry still had enough apparent capacity, so the fault did not always show under normal conditions.

Engineering Comment

This is why QoS misconfiguration can remain hidden. A class may be wrong but harmless while spare capacity exists. The defect appears when a failure state, maintenance state, adaptive-modulation fallback, or traffic burst removes the spare capacity.

Capacity Check During Degraded Microwave Operation

During rain fade, available capacity fell to:

R_{degraded}=10\ \text{Mbit/s}

Because the priority queue was uncapped, the same priority load consumed:

A_{prio}=9.3\ \text{Mbit/s}

Residual capacity was:

R_{res,degraded}=10-9.3=0.7\ \text{Mbit/s}

The telemetry class needed:

A_{tel}=3.2\ \text{Mbit/s}

The effective utilization for telemetry using residual capacity was:

\displaystyle \rho_{tel}=\frac{A_{tel}}{R_{res,degraded}}=\frac{3.2}{0.7}=4.57

Since:

\rho_{tel}>1

the telemetry queue was unstable. Delay would grow until buffers overflowed or packets were dropped.

Engineering Comment

The queueing result is decisive. This is not a marginal tuning problem; the protected telemetry class received less service rate than its arrival rate. No amount of average latency reporting can make an unstable queue acceptable.

Buffer Overflow Time

Assume the telemetry queue had 2\ \text{MB} of effective buffer available before drop behavior became visible:

B=2\ \text{MB}=16\ \text{Mbit}

Queue growth rate was approximately:

G_q=A_{tel}-R_{res,degraded}=3.2-0.7=2.5\ \text{Mbit/s}

Time to fill the buffer was:

\displaystyle t_{fill}=\frac{B}{G_q}=\frac{16}{2.5}=6.4\ \text{s}

This matches the event logs: telemetry updates degraded after a few seconds of sustained video traffic during backup operation, not immediately when the path switched.

Engineering Comment

The delay was not mysterious. A finite buffer exposed the mistake after a predictable time. This is a useful diagnostic pattern: if delay grows over seconds and then drops appear, look for a queue being fed faster than it drains.

Intended Corrected Queue Screen

The corrected degraded policy caps strict priority at:

R_{prio,cap}=2\ \text{Mbit/s}

and reserves telemetry service:

R_{tel}=5\ \text{Mbit/s}

For 600\ \text{byte} telemetry packets:

L=600(8)=4800\ \text{bit}

Telemetry service rate is:

\displaystyle \mu=\frac{5\times10^6}{4800}=1042\ \text{packet/s}

Telemetry arrival rate is:

\displaystyle \lambda=\frac{3.2\times10^6}{4800}=667\ \text{packet/s}

Utilization becomes:

\displaystyle \rho=\frac{667}{1042}=0.64

Using an M/M/1 screening model, average queueing delay is:

\displaystyle W_q=\frac{\rho}{\mu(1-\rho)}
\displaystyle W_q=\frac{0.64}{1042(1-0.64)}=0.00171\ \text{s}=1.71\ \text{ms}

The p95 queueing-time screen is:

\displaystyle W_{q,95}=\frac{\ln(\rho/0.05)}{\mu-\lambda}
\displaystyle W_{q,95}=\frac{\ln(0.64/0.05)}{1042-667}=0.00680\ \text{s}=6.80\ \text{ms}

If fixed propagation, serialization, and forwarding delay add approximately 2.0\ \text{ms}, the p95 telemetry latency screen is:

t_{95}\approx 2.0+6.8=8.8\ \text{ms}

This is below the 25\ \text{ms} requirement.

Engineering Comment

The corrected design is not only “mark packets correctly.” It also caps priority traffic. A strict-priority queue without a policer can protect one class by damaging every other class during contention.

Root Cause

The root cause was a QoS trust-boundary failure combined with an incomplete backup template.

The immediate technical causes were:

  1. a surveillance video gateway emitted DSCP 46, normally reserved for expedited forwarding traffic;
  2. the access switch trusted that marking instead of remarking video to best effort;
  3. the backup microwave egress policy mapped DSCP 46 to strict priority;
  4. the strict-priority policer present in the primary template was missing from the backup template;
  5. acceptance testing did not include degraded microwave capacity with realistic video traffic.

The organizational cause was a handover gap. The team validated that the microwave link could carry packets, but did not validate class treatment at every boundary in the degraded state.

Corrective Actions

The corrective design used four controls.

1. Define the Trust Boundary

Only network-controlled devices may set trusted service markings. Customer, camera, maintenance laptop, and unmanaged device markings are reset at ingress.

The corrected ingress rule is:

Source trafficIngress action
operational voice from approved phonespreserve or mark strict-priority DSCP
control telemetry from approved portsmark assured telemetry DSCP
management from approved subnetmark management class
surveillance videoremark to best effort
unknown trafficremark to best effort or drop by policy

2. Police Strict Priority

Strict priority is capped:

R_{prio,cap}=2\ \text{Mbit/s}

Any traffic above the cap is remarked, shaped, or dropped according to the service policy. This protects non-priority classes from starvation.

3. Reserve Telemetry During Degraded Operation

Telemetry receives a minimum service rate during backup:

R_{tel,min}=5\ \text{Mbit/s}

The policy is tested at 10\ \text{Mbit/s} microwave capacity, not only at the nominal 20\ \text{Mbit/s} state.

4. Monitor by Class

Alarms are tied to class-specific counters:

SignalWatch condition
strict-priority utilizationsustained above 70\% of cap
strict-priority dropsany during normal operation
telemetry p95 latencyabove 15\ \text{ms}
telemetry p95 jitterabove 4\ \text{ms}
telemetry queue depthsustained growth for more than 10\ \text{s}
best-effort DSCP 46 attemptsany occurrence
degraded microwave capacityautomatic QoS policy confirmation required

Validation Test

The corrected system was validated with an injected test profile:

Test trafficRate
operational voice0.8\ \text{Mbit/s}
telemetry3.2\ \text{Mbit/s}
surveillance video8.5\ \text{Mbit/s}
management0.4\ \text{Mbit/s}
best-effort file transferburst to available capacity

The microwave service was forced to the degraded 10\ \text{Mbit/s} state for the test window.

Acceptance evidence required:

  1. packet captures proving ingress remarking;
  2. class counters showing video in best effort, not strict priority;
  3. strict-priority policer counters showing the cap active;
  4. telemetry p95 latency below 25\ \text{ms};
  5. telemetry packet loss below the service threshold;
  6. voice quality probes within jitter limits;
  7. best-effort drops accepted during contention;
  8. monitoring alarms tested and acknowledged by operations.

The corrected policy passed because critical classes were protected by both marking rules and capacity limits. Best-effort traffic degraded first, as intended.

Final Decision

The engineering decision was:

Restore service only after ingress remarking, strict-priority policing, degraded-capacity QoS validation, and class-specific monitoring evidence are in place.

The main lesson is that QoS is not a label on a packet. It is an end-to-end treatment contract. A service remains at risk until classification, marking, policing, scheduling, degraded capacity, and monitoring are validated together.

REF

See also