Case study

Packet QoS Misclassification and Queue Starvation Case Study

Packet QoS misclassification case study for DSCP trust boundaries, strict-priority queues, jitter, packet loss, corrective actions, and validation evidence.

Branch: Telecommunications Engineering
Content: Case study
Updated: Jun 23, 2026
Revision: v1.0.0 · reviewed

This case study analyzes a packet-network failure where quality-of-service markings were trusted at the wrong boundary. A video stream entered the strict-priority queue, consumed most of a degraded microwave backhaul link, and starved a telemetry class that had been correctly engineered on paper.

The case is representative of real telecommunications service-assurance failures: the physical link was still up, the nominal topology was redundant, and average utilization looked explainable, but class counters and delay percentiles showed that the service had lost its traffic separation.

System Context

A remote operations site connects to a control center through a packet backhaul network. The primary fiber path normally carries the traffic. A microwave path provides backup during fiber maintenance and restoration events. The backup path uses adaptive modulation, so available service capacity can fall during rain fade.

The site carries four traffic groups:

Traffic group	Intended class	Engineering requirement
control telemetry	assured low-latency class	one-way p95 latency below $25\ \text{ms}$
operational voice	strict-priority class	low jitter, low packet loss
management traffic	controlled class	reachable during faults
surveillance video and file transfer	best-effort class	may be shaped or dropped during degraded operation

The intended degraded-mode policy is:

Class	Intended degraded service
strict-priority voice	capped at $2\ \text{Mbit/s}$
assured telemetry	minimum $5\ \text{Mbit/s}$
management	minimum $0.5\ \text{Mbit/s}$
best effort	remaining capacity only

The service-design review assumed that surveillance video would remain best effort. That assumption was not validated at the ingress boundary.

Incident Symptoms

During a planned fiber outage, traffic moved onto the microwave backup. Light rain reduced the microwave service capacity from $20\ \text{Mbit/s}$ to $10\ \text{Mbit/s}$ through adaptive modulation and coding fallback.

Operations observed:

telemetry update gaps of several seconds;
p95 telemetry latency spikes above $800\ \text{ms}$ ;
intermittent voice distortion;
packet drops in the assured telemetry queue;
no complete link-down alarm;
high strict-priority queue occupancy during the event.

The first misleading interpretation was “rain fade reduced bandwidth.” That was true, but incomplete. The root cause was not only reduced capacity. It was reduced capacity combined with a QoS trust-boundary error.

Evidence Collected

Packet captures and queue counters showed:

Evidence item	Observation
video gateway marking	surveillance stream sent with DSCP $46$
access switch policy	customer DSCP markings were trusted
microwave egress policy	DSCP $46$ mapped to strict-priority queue
priority policer	absent in the deployed backup template
telemetry marking	telemetry packets remained correctly marked for assured service
queue counter trend	strict-priority queue dominated the egress scheduler
telemetry class	correct packets waited behind residual capacity

The service had two simultaneous errors:

the ingress boundary trusted a non-critical source marking;
the backup egress template did not cap strict-priority traffic.

The result was starvation of the assured class when the microwave link fell to degraded capacity.

Capacity Check Before the Fault

Under normal backup capacity, the microwave path provided:

R_{normal}=20\ \text{Mbit/s}

The misclassified surveillance stream and real voice traffic entering strict priority were:

A_{prio}=8.5+0.8=9.3\ \text{Mbit/s}

Residual capacity after strict priority was:

R_{res,normal}=20-9.3=10.7\ \text{Mbit/s}

Telemetry demand was:

A_{tel}=3.2\ \text{Mbit/s}

With $10.7\ \text{Mbit/s}$ residual capacity, telemetry still had enough apparent capacity, so the fault did not always show under normal conditions.

Engineering Comment

This is why QoS misconfiguration can remain hidden. A class may be wrong but harmless while spare capacity exists. The defect appears when a failure state, maintenance state, adaptive-modulation fallback, or traffic burst removes the spare capacity.

Capacity Check During Degraded Microwave Operation

During rain fade, available capacity fell to:

R_{degraded}=10\ \text{Mbit/s}

Because the priority queue was uncapped, the same priority load consumed:

A_{prio}=9.3\ \text{Mbit/s}

Residual capacity was:

R_{res,degraded}=10-9.3=0.7\ \text{Mbit/s}

The telemetry class needed:

A_{tel}=3.2\ \text{Mbit/s}

The effective utilization for telemetry using residual capacity was:

\displaystyle \rho_{tel}=\frac{A_{tel}}{R_{res,degraded}}=\frac{3.2}{0.7}=4.57

Since:

\rho_{tel}>1

the telemetry queue was unstable. Delay would grow until buffers overflowed or packets were dropped.

Engineering Comment

The queueing result is decisive. This is not a marginal tuning problem; the protected telemetry class received less service rate than its arrival rate. No amount of average latency reporting can make an unstable queue acceptable.

Buffer Overflow Time

Assume the telemetry queue had $2\ \text{MB}$ of effective buffer available before drop behavior became visible:

B=2\ \text{MB}=16\ \text{Mbit}

Queue growth rate was approximately:

G_q=A_{tel}-R_{res,degraded}=3.2-0.7=2.5\ \text{Mbit/s}

Time to fill the buffer was:

\displaystyle t_{fill}=\frac{B}{G_q}=\frac{16}{2.5}=6.4\ \text{s}

This matches the event logs: telemetry updates degraded after a few seconds of sustained video traffic during backup operation, not immediately when the path switched.

Engineering Comment

The delay was not mysterious. A finite buffer exposed the mistake after a predictable time. This is a useful diagnostic pattern: if delay grows over seconds and then drops appear, look for a queue being fed faster than it drains.

Intended Corrected Queue Screen

The corrected degraded policy caps strict priority at:

R_{prio,cap}=2\ \text{Mbit/s}

and reserves telemetry service:

R_{tel}=5\ \text{Mbit/s}

For $600\ \text{byte}$ telemetry packets:

L=600(8)=4800\ \text{bit}

Telemetry service rate is:

\displaystyle \mu=\frac{5\times10^6}{4800}=1042\ \text{packet/s}

Telemetry arrival rate is:

\displaystyle \lambda=\frac{3.2\times10^6}{4800}=667\ \text{packet/s}

Utilization becomes:

\displaystyle \rho=\frac{667}{1042}=0.64

Using an M/M/1 screening model, average queueing delay is:

\displaystyle W_q=\frac{\rho}{\mu(1-\rho)}

\displaystyle W_q=\frac{0.64}{1042(1-0.64)}=0.00171\ \text{s}=1.71\ \text{ms}

The p95 queueing-time screen is:

\displaystyle W_{q,95}=\frac{\ln(\rho/0.05)}{\mu-\lambda}

\displaystyle W_{q,95}=\frac{\ln(0.64/0.05)}{1042-667}=0.00680\ \text{s}=6.80\ \text{ms}

If fixed propagation, serialization, and forwarding delay add approximately $2.0\ \text{ms}$ , the p95 telemetry latency screen is:

t_{95}\approx 2.0+6.8=8.8\ \text{ms}

This is below the $25\ \text{ms}$ requirement.

Engineering Comment

The corrected design is not only “mark packets correctly.” It also caps priority traffic. A strict-priority queue without a policer can protect one class by damaging every other class during contention.

Root Cause

The root cause was a QoS trust-boundary failure combined with an incomplete backup template.

The immediate technical causes were:

a surveillance video gateway emitted DSCP $46$ , normally reserved for expedited forwarding traffic;
the access switch trusted that marking instead of remarking video to best effort;
the backup microwave egress policy mapped DSCP $46$ to strict priority;
the strict-priority policer present in the primary template was missing from the backup template;
acceptance testing did not include degraded microwave capacity with realistic video traffic.

The organizational cause was a handover gap. The team validated that the microwave link could carry packets, but did not validate class treatment at every boundary in the degraded state.

Corrective Actions

The corrective design used four controls.

1. Define the Trust Boundary

Only network-controlled devices may set trusted service markings. Customer, camera, maintenance laptop, and unmanaged device markings are reset at ingress.

The corrected ingress rule is:

Source traffic	Ingress action
operational voice from approved phones	preserve or mark strict-priority DSCP
control telemetry from approved ports	mark assured telemetry DSCP
management from approved subnet	mark management class
surveillance video	remark to best effort
unknown traffic	remark to best effort or drop by policy

2. Police Strict Priority

Strict priority is capped:

R_{prio,cap}=2\ \text{Mbit/s}

Any traffic above the cap is remarked, shaped, or dropped according to the service policy. This protects non-priority classes from starvation.

3. Reserve Telemetry During Degraded Operation

Telemetry receives a minimum service rate during backup:

R_{tel,min}=5\ \text{Mbit/s}

The policy is tested at $10\ \text{Mbit/s}$ microwave capacity, not only at the nominal $20\ \text{Mbit/s}$ state.

4. Monitor by Class

Alarms are tied to class-specific counters:

Signal	Watch condition
strict-priority utilization	sustained above $70\%$ of cap
strict-priority drops	any during normal operation
telemetry p95 latency	above $15\ \text{ms}$
telemetry p95 jitter	above $4\ \text{ms}$
telemetry queue depth	sustained growth for more than $10\ \text{s}$
best-effort DSCP $46$ attempts	any occurrence
degraded microwave capacity	automatic QoS policy confirmation required

Validation Test

The corrected system was validated with an injected test profile:

Test traffic	Rate
operational voice	$0.8\ \text{Mbit/s}$
telemetry	$3.2\ \text{Mbit/s}$
surveillance video	$8.5\ \text{Mbit/s}$
management	$0.4\ \text{Mbit/s}$
best-effort file transfer	burst to available capacity

The microwave service was forced to the degraded $10\ \text{Mbit/s}$ state for the test window.

Acceptance evidence required:

packet captures proving ingress remarking;
class counters showing video in best effort, not strict priority;
strict-priority policer counters showing the cap active;
telemetry p95 latency below $25\ \text{ms}$ ;
telemetry packet loss below the service threshold;
voice quality probes within jitter limits;
best-effort drops accepted during contention;
monitoring alarms tested and acknowledged by operations.

The corrected policy passed because critical classes were protected by both marking rules and capacity limits. Best-effort traffic degraded first, as intended.

Final Decision

The engineering decision was:

Restore service only after ingress remarking, strict-priority policing, degraded-capacity QoS validation, and class-specific monitoring evidence are in place.

The main lesson is that QoS is not a label on a packet. It is an end-to-end treatment contract. A service remains at risk until classification, marking, policing, scheduling, degraded capacity, and monitoring are validated together.

REF

Disciplines

Packet QoS Misclassification and Queue Starvation Case Study

System Context

Incident Symptoms

Evidence Collected

Capacity Check Before the Fault

Engineering Comment

Capacity Check During Degraded Microwave Operation

Engineering Comment

Buffer Overflow Time

Engineering Comment

Intended Corrected Queue Screen

Engineering Comment

Root Cause

Corrective Actions

1. Define the Trust Boundary

2. Police Strict Priority

3. Reserve Telemetry During Degraded Operation

4. Monitor by Class

Validation Test

Final Decision

See also