Case study
Packet QoS Misclassification and Queue Starvation Case Study
Packet QoS misclassification case study for DSCP trust boundaries, strict-priority queues, jitter, packet loss, corrective actions, and validation evidence.
This case study analyzes a packet-network failure where quality-of-service markings were trusted at the wrong boundary. A video stream entered the strict-priority queue, consumed most of a degraded microwave backhaul link, and starved a telemetry class that had been correctly engineered on paper.
The case is representative of real telecommunications service-assurance failures: the physical link was still up, the nominal topology was redundant, and average utilization looked explainable, but class counters and delay percentiles showed that the service had lost its traffic separation.
System Context
A remote operations site connects to a control center through a packet backhaul network. The primary fiber path normally carries the traffic. A microwave path provides backup during fiber maintenance and restoration events. The backup path uses adaptive modulation, so available service capacity can fall during rain fade.
The site carries four traffic groups:
| Traffic group | Intended class | Engineering requirement |
|---|---|---|
| control telemetry | assured low-latency class | one-way p95 latency below 25\ \text{ms} |
| operational voice | strict-priority class | low jitter, low packet loss |
| management traffic | controlled class | reachable during faults |
| surveillance video and file transfer | best-effort class | may be shaped or dropped during degraded operation |
The intended degraded-mode policy is:
| Class | Intended degraded service |
|---|---|
| strict-priority voice | capped at 2\ \text{Mbit/s} |
| assured telemetry | minimum 5\ \text{Mbit/s} |
| management | minimum 0.5\ \text{Mbit/s} |
| best effort | remaining capacity only |
The service-design review assumed that surveillance video would remain best effort. That assumption was not validated at the ingress boundary.
Incident Symptoms
During a planned fiber outage, traffic moved onto the microwave backup. Light rain reduced the microwave service capacity from 20\ \text{Mbit/s} to 10\ \text{Mbit/s} through adaptive modulation and coding fallback.
Operations observed:
- telemetry update gaps of several seconds;
- p95 telemetry latency spikes above 800\ \text{ms};
- intermittent voice distortion;
- packet drops in the assured telemetry queue;
- no complete link-down alarm;
- high strict-priority queue occupancy during the event.
The first misleading interpretation was “rain fade reduced bandwidth.” That was true, but incomplete. The root cause was not only reduced capacity. It was reduced capacity combined with a QoS trust-boundary error.
Evidence Collected
Packet captures and queue counters showed:
| Evidence item | Observation |
|---|---|
| video gateway marking | surveillance stream sent with DSCP 46 |
| access switch policy | customer DSCP markings were trusted |
| microwave egress policy | DSCP 46 mapped to strict-priority queue |
| priority policer | absent in the deployed backup template |
| telemetry marking | telemetry packets remained correctly marked for assured service |
| queue counter trend | strict-priority queue dominated the egress scheduler |
| telemetry class | correct packets waited behind residual capacity |
The service had two simultaneous errors:
- the ingress boundary trusted a non-critical source marking;
- the backup egress template did not cap strict-priority traffic.
The result was starvation of the assured class when the microwave link fell to degraded capacity.
Capacity Check Before the Fault
Under normal backup capacity, the microwave path provided:
The misclassified surveillance stream and real voice traffic entering strict priority were:
Residual capacity after strict priority was:
Telemetry demand was:
With 10.7\ \text{Mbit/s} residual capacity, telemetry still had enough apparent capacity, so the fault did not always show under normal conditions.
Engineering Comment
This is why QoS misconfiguration can remain hidden. A class may be wrong but harmless while spare capacity exists. The defect appears when a failure state, maintenance state, adaptive-modulation fallback, or traffic burst removes the spare capacity.
Capacity Check During Degraded Microwave Operation
During rain fade, available capacity fell to:
Because the priority queue was uncapped, the same priority load consumed:
Residual capacity was:
The telemetry class needed:
The effective utilization for telemetry using residual capacity was:
Since:
the telemetry queue was unstable. Delay would grow until buffers overflowed or packets were dropped.
Engineering Comment
The queueing result is decisive. This is not a marginal tuning problem; the protected telemetry class received less service rate than its arrival rate. No amount of average latency reporting can make an unstable queue acceptable.
Buffer Overflow Time
Assume the telemetry queue had 2\ \text{MB} of effective buffer available before drop behavior became visible:
Queue growth rate was approximately:
Time to fill the buffer was:
This matches the event logs: telemetry updates degraded after a few seconds of sustained video traffic during backup operation, not immediately when the path switched.
Engineering Comment
The delay was not mysterious. A finite buffer exposed the mistake after a predictable time. This is a useful diagnostic pattern: if delay grows over seconds and then drops appear, look for a queue being fed faster than it drains.
Intended Corrected Queue Screen
The corrected degraded policy caps strict priority at:
and reserves telemetry service:
For 600\ \text{byte} telemetry packets:
Telemetry service rate is:
Telemetry arrival rate is:
Utilization becomes:
Using an M/M/1 screening model, average queueing delay is:
The p95 queueing-time screen is:
If fixed propagation, serialization, and forwarding delay add approximately 2.0\ \text{ms}, the p95 telemetry latency screen is:
This is below the 25\ \text{ms} requirement.
Engineering Comment
The corrected design is not only “mark packets correctly.” It also caps priority traffic. A strict-priority queue without a policer can protect one class by damaging every other class during contention.
Root Cause
The root cause was a QoS trust-boundary failure combined with an incomplete backup template.
The immediate technical causes were:
- a surveillance video gateway emitted DSCP 46, normally reserved for expedited forwarding traffic;
- the access switch trusted that marking instead of remarking video to best effort;
- the backup microwave egress policy mapped DSCP 46 to strict priority;
- the strict-priority policer present in the primary template was missing from the backup template;
- acceptance testing did not include degraded microwave capacity with realistic video traffic.
The organizational cause was a handover gap. The team validated that the microwave link could carry packets, but did not validate class treatment at every boundary in the degraded state.
Corrective Actions
The corrective design used four controls.
1. Define the Trust Boundary
Only network-controlled devices may set trusted service markings. Customer, camera, maintenance laptop, and unmanaged device markings are reset at ingress.
The corrected ingress rule is:
| Source traffic | Ingress action |
|---|---|
| operational voice from approved phones | preserve or mark strict-priority DSCP |
| control telemetry from approved ports | mark assured telemetry DSCP |
| management from approved subnet | mark management class |
| surveillance video | remark to best effort |
| unknown traffic | remark to best effort or drop by policy |
2. Police Strict Priority
Strict priority is capped:
Any traffic above the cap is remarked, shaped, or dropped according to the service policy. This protects non-priority classes from starvation.
3. Reserve Telemetry During Degraded Operation
Telemetry receives a minimum service rate during backup:
The policy is tested at 10\ \text{Mbit/s} microwave capacity, not only at the nominal 20\ \text{Mbit/s} state.
4. Monitor by Class
Alarms are tied to class-specific counters:
| Signal | Watch condition |
|---|---|
| strict-priority utilization | sustained above 70\% of cap |
| strict-priority drops | any during normal operation |
| telemetry p95 latency | above 15\ \text{ms} |
| telemetry p95 jitter | above 4\ \text{ms} |
| telemetry queue depth | sustained growth for more than 10\ \text{s} |
| best-effort DSCP 46 attempts | any occurrence |
| degraded microwave capacity | automatic QoS policy confirmation required |
Validation Test
The corrected system was validated with an injected test profile:
| Test traffic | Rate |
|---|---|
| operational voice | 0.8\ \text{Mbit/s} |
| telemetry | 3.2\ \text{Mbit/s} |
| surveillance video | 8.5\ \text{Mbit/s} |
| management | 0.4\ \text{Mbit/s} |
| best-effort file transfer | burst to available capacity |
The microwave service was forced to the degraded 10\ \text{Mbit/s} state for the test window.
Acceptance evidence required:
- packet captures proving ingress remarking;
- class counters showing video in best effort, not strict priority;
- strict-priority policer counters showing the cap active;
- telemetry p95 latency below 25\ \text{ms};
- telemetry packet loss below the service threshold;
- voice quality probes within jitter limits;
- best-effort drops accepted during contention;
- monitoring alarms tested and acknowledged by operations.
The corrected policy passed because critical classes were protected by both marking rules and capacity limits. Best-effort traffic degraded first, as intended.
Final Decision
The engineering decision was:
Restore service only after ingress remarking, strict-priority policing, degraded-capacity QoS validation, and class-specific monitoring evidence are in place.
The main lesson is that QoS is not a label on a packet. It is an end-to-end treatment contract. A service remains at risk until classification, marking, policing, scheduling, degraded capacity, and monitoring are validated together.