Exercise set

Diagnostic System Validation, Threshold, and Workflow Exercises

Worked diagnostic-system exercises for sensitivity, specificity, predictive values, thresholds, false-positive workload, reader agreement, monitoring and release gates.

These exercises practise validation and workflow decisions for diagnostic systems linked to biomedical imaging and measurement. They cover sensitivity, specificity, predictive values, thresholds, false-positive workload, confidence intervals, subgroup gaps, measurement uncertainty, throughput, latency, reader agreement, prevalence shift, model update evidence, monitoring triggers and release gates.

The focus is narrower than general medical-device validation. A diagnostic system must prove that a locked score, threshold, dataset, reference standard, deployment prevalence and clinical workflow can support the intended decision without creating unacceptable misses, false positives or operational burden.

How to Use These Exercises

For each problem, define:

  1. the diagnostic task and intended-use population;
  2. the reference standard and dataset boundary;
  3. the locked model, score, threshold or reader rule;
  4. the workflow constraint: queue, false-positive workup, latency, confirmatory test or escalation;
  5. the monitoring evidence required after deployment.

The common mistake is reporting sensitivity and specificity as if they completely define a diagnostic device. In deployment, prevalence, subgroup mix, threshold, workflow capacity, false-positive burden, reference quality and monitoring drift can dominate the release decision.

Release Evidence Notes

Diagnostic-performance evidence should match deployment. A curated validation set may not represent the prevalence, acquisition variability, user workflow or subgroup distribution encountered in service.

Threshold evidence should make trade-offs visible. Lower thresholds can improve sensitivity while overloading clinical workup with false positives. Higher thresholds can reduce workload while missing intended positives.

Workflow evidence should be quantitative. A diagnostic system that is statistically good but creates more cases than the clinical pathway can review safely is not release-ready.

Post-deployment monitoring should have triggers. Drift in prevalence, score distribution, false-positive rate, turnaround time, reader disagreement or QA phantom metrics should trigger review before performance erodes silently.

Scenario Map

ScenarioExercisesPrimary calculationEngineering decision
Validation metrics1-6sensitivity, specificity, confidence interval and subgroup gapDecide whether dataset evidence supports the claim.
Threshold and prevalence7-10, 15predictive values, Youden index, threshold workload and prevalence shiftDecide whether the chosen threshold fits deployment.
Workflow and monitoring11-18queue capacity, latency, reader agreement, model update sample, monitoring trigger and release gateDecide whether the diagnostic system can operate safely.

Exercise 1: Sensitivity and Specificity

A validation dataset has:

OutcomeCount
true positives90
false negatives10
true negatives180
false positives20

Calculate sensitivity and specificity.

Solution

Sensitivity:

\displaystyle Se=\frac{TP}{TP+FN}=\frac{90}{90+10}=0.900

Specificity:

\displaystyle Sp=\frac{TN}{TN+FP}=\frac{180}{180+20}=0.900

Both are:

90.0\%

Engineering Comment

The equal percentages are convenient but incomplete. Confidence intervals, subgroup behavior, prevalence and workflow burden still determine release.

Plausibility Check

Each denominator is clean: 100 positives and 200 negatives. The rates are exactly 90\%.

Exercise 2: Predictive Values at Deployment Prevalence

Use:

Se=0.90,\quad Sp=0.90

Deployment prevalence is:

p=0.08

Calculate PPV and NPV.

Solution

Positive predictive value:

\displaystyle PPV=\frac{Se p}{Se p+(1-Sp)(1-p)}
\displaystyle PPV=\frac{0.90(0.08)}{0.90(0.08)+0.10(0.92)}=0.439

Negative predictive value:

\displaystyle NPV=\frac{Sp(1-p)}{(1-Se)p+Sp(1-p)}
\displaystyle NPV=\frac{0.90(0.92)}{0.10(0.08)+0.90(0.92)}=0.990

Engineering Comment

At low prevalence, false positives can outnumber true positives even with high specificity. PPV is a deployment metric, not only a model metric.

Plausibility Check

Only 8\% of cases are positive, so PPV below 50\% is plausible while NPV is very high.

Exercise 3: False-Positive Workload per 1000 Cases

For the deployment in Exercise 2, calculate expected false-positive workups per:

1000

screened cases.

Solution

Negative cases:

N_-=(1-p)(1000)=0.92(1000)=920

False positives:

FP=(1-Sp)N_-=0.10(920)=92

Engineering Comment

Ninety-two workups per 1000 screened cases may be acceptable or excessive depending on staffing, patient risk, confirmatory testing and communication pathway.

Plausibility Check

Ten percent of about nine hundred negative cases should be about ninety false positives.

Exercise 4: Threshold Selection by Youden Index

Three thresholds have validation performance:

ThresholdSensitivitySpecificity
0.350.960.80
0.500.900.90
0.650.780.95

Calculate Youden index:

J=Se+Sp-1

Solution

For threshold 0.35:

J=0.96+0.80-1=0.76

For threshold 0.50:

J=0.90+0.90-1=0.80

For threshold 0.65:

J=0.78+0.95-1=0.73

The highest Youden index is at threshold:

0.50

Engineering Comment

Youden index is useful but not sufficient. It does not include prevalence, false-positive workload, false-negative severity or workflow capacity.

Plausibility Check

The balanced threshold has the highest combined sensitivity and specificity. The result fits the table.

Exercise 5: Sensitivity Confidence Interval Screen

A validation set has:

TP=90,\quad FN=10

Use the simple standard error approximation:

\displaystyle SE_p=\sqrt{\frac{\hat p(1-\hat p)}{n}}

and approximate 95\% half-width:

h=1.96SE_p

Calculate half-width for sensitivity.

Solution

Sensitivity estimate:

\displaystyle \hat p=\frac{90}{100}=0.90

Standard error:

\displaystyle SE_p=\sqrt{\frac{0.90(0.10)}{100}}=0.030

Half-width:

h=1.96(0.030)=0.0588

Therefore the approximate half-width is:

5.9\ \text{percentage points}

Engineering Comment

Point estimates without uncertainty can overstate confidence. A release claim may need a lower-bound requirement, not only a nominal rate.

Plausibility Check

One hundred positive cases gives a confidence band of several percentage points. The result is plausible.

Exercise 6: Subgroup Sensitivity Gap

Overall sensitivity is:

Se_{all}=0.90

A subgroup sensitivity is:

Se_{sub}=0.78

The allowed subgroup gap is:

0.08

Check the gap.

Solution

Gap:

G=Se_{all}-Se_{sub}=0.90-0.78=0.12

Since:

0.12>0.08

the subgroup gap fails.

Engineering Comment

Overall validation can hide subgroup failure. The release package should identify whether the subgroup is within intended use, needs warning, retraining, data expansion or exclusion.

Plausibility Check

The subgroup result is visibly lower than the overall result by twelve percentage points, so it exceeds an eight-point limit.

Exercise 7: Measurement Uncertainty for Vessel Diameter

An imaging tool measures vessel diameter:

d=3.20\ \text{mm}

Independent standard uncertainties are:

SourceStandard uncertainty
pixel calibration0.04\ \text{mm}
segmentation repeatability0.07\ \text{mm}
motion residual0.05\ \text{mm}

Calculate combined standard uncertainty.

Solution

RSS uncertainty:

u_c=\sqrt{0.04^2+0.07^2+0.05^2}
u_c=\sqrt{0.0016+0.0049+0.0025}=0.0949\ \text{mm}

Engineering Comment

If clinical decisions depend on small diameter changes, measurement uncertainty should be compared with the decision threshold, not only reported after the fact.

Plausibility Check

The largest component is 0.07\ \text{mm} and RSS of three small terms should be just under 0.1\ \text{mm}.

Exercise 8: Decision Limit with Guard Band

A treatment decision threshold is:

d_{lim}=3.50\ \text{mm}

Measured diameter is:

d=3.20\ \text{mm}

Use expanded uncertainty:

U=2u_c=0.190\ \text{mm}

Check whether the guarded upper value exceeds the decision limit.

Solution

Guarded upper value:

d_g=d+U=3.20+0.190=3.39\ \text{mm}

Since:

3.39<3.50

the guarded measurement remains below the threshold.

Engineering Comment

Guard bands prevent release decisions from depending on nominal measurements that are too close to a clinical threshold.

Plausibility Check

The uncertainty adds less than 0.2\ \text{mm}, so the guarded value remains below 3.5\ \text{mm}.

Exercise 9: Diagnostic Throughput and Queue Risk

A diagnostic review station can process:

18\ \text{cases/h}

The incoming rate during screening is:

15\ \text{cases/h}

An algorithm adds false-positive workups at:

4\ \text{cases/h}

Check capacity.

Solution

Total demand:

\lambda=15+4=19\ \text{cases/h}

Capacity margin:

M=18-19=-1\ \text{case/h}

The station is overloaded.

Engineering Comment

A model can pass statistical validation and still fail deployment if its false-positive load exceeds review capacity. Workflow is part of diagnostic release.

Plausibility Check

Adding four workups to fifteen cases creates nineteen cases per hour, which is greater than capacity.

Exercise 10: Workload-Limited Threshold Gate

At prevalence:

p=0.10

per 1000 screened cases, threshold 0.35 has specificity:

Sp=0.80

The false-positive workload limit is:

120

per 1000 cases. Check threshold 0.35.

Solution

Negative cases:

N_-=0.90(1000)=900

False positives:

FP=(1-0.80)(900)=180

Since:

180>120

threshold 0.35 fails the workload gate.

Engineering Comment

High sensitivity can be unacceptable if it creates too many false-positive workups for the pathway. The threshold must be released with workflow evidence.

Plausibility Check

Twenty percent false positives among nine hundred negative cases gives one hundred eighty false positives.

Exercise 11: Diagnostic Latency Gate

A diagnostic pipeline includes:

StageTime
image transfer12\ \text{s}
inference8\ \text{s}
QA check20\ \text{s}
result routing15\ \text{s}

The intended-use requirement is:

60\ \text{s}

Calculate total latency and margin.

Solution

Total latency:

t=12+8+20+15=55\ \text{s}

Margin:

M=60-55=5\ \text{s}

The latency gate passes narrowly.

Engineering Comment

Latency should be checked under peak load and failure recovery. A five-second margin can disappear with queueing, network delay or manual review.

Plausibility Check

The four stages are all tens of seconds or less, and their sum is below one minute.

Exercise 12: Reader Agreement Kappa

Two readers classify 120 cases. They agree on 96 cases. Expected agreement by class prevalence is estimated as:

P_e=0.50

Calculate Cohen’s kappa:

\displaystyle \kappa=\frac{P_o-P_e}{1-P_e}

Solution

Observed agreement:

\displaystyle P_o=\frac{96}{120}=0.80

Kappa:

\displaystyle \kappa=\frac{0.80-0.50}{1-0.50}=0.60

Engineering Comment

Reader agreement affects reference quality and workflow consistency. A diagnostic validation set is weak if labels are unstable or arbitration rules are unclear.

Plausibility Check

Observed agreement is well above expected agreement, so kappa is positive and moderate.

Exercise 13: Calibration Drift and Threshold Bias

A score calibration check shows the deployed system scores:

0.04

higher than validation on average. The locked threshold is:

0.50

Estimate the effective validation-equivalent threshold.

Solution

If deployed scores are biased high, a deployed threshold of 0.50 corresponds to validation score:

T_{eq}=0.50-0.04=0.46

Engineering Comment

Calibration drift can silently lower the effective threshold, increasing sensitivity and false-positive workload. Score calibration should be monitored after deployment.

Plausibility Check

A positive score bias makes cases cross the threshold more easily, so the validation-equivalent threshold should be lower.

Exercise 14: Missed-Case Risk per Screened Population

A deployment screens:

N=5000

cases per month. Prevalence is:

p=0.06

Sensitivity is:

Se=0.92

Estimate expected false negatives per month.

Solution

Positive cases:

N_+=Np=(5000)(0.06)=300

False negatives:

FN=(1-Se)N_+=0.08(300)=24

Engineering Comment

False-negative count communicates clinical impact more directly than sensitivity alone. Risk controls should address the expected missed cases and their severity.

Plausibility Check

Eight percent of three hundred positives is twenty-four, so the result is exact.

Exercise 15: Prevalence Shift PPV Change

A diagnostic threshold has:

Se=0.90,\quad Sp=0.90

Calculate PPV when prevalence increases from:

p_1=0.08

to:

p_2=0.20

Solution

At p_1=0.08:

\displaystyle PPV_1=\frac{0.90(0.08)}{0.90(0.08)+0.10(0.92)}=0.439

At p_2=0.20:

\displaystyle PPV_2=\frac{0.90(0.20)}{0.90(0.20)+0.10(0.80)}=0.692

Engineering Comment

Deployment prevalence changes predictive values even when sensitivity and specificity stay fixed. Monitoring should compare actual case mix with the release basis.

Plausibility Check

Higher prevalence should increase PPV because true positives become more common relative to false positives. The result does.

Exercise 16: Model Update Validation Sample

A model update must demonstrate at least:

95\%

zero-failure confidence that a critical software check catches a known invalid input. Use the simple rule:

\displaystyle n\approx\frac{3}{p_{fail}}

for zero observed failures. If the tolerated miss probability is:

p_{fail}=0.05

estimate required test cases.

Solution

Required tests:

\displaystyle n\approx\frac{3}{0.05}=60

Engineering Comment

This screen does not replace a full validation plan, but it prevents a model update from being released from only a few cherry-picked regression examples.

Plausibility Check

A five percent tolerated miss probability should require tens of zero-failure cases, not just a handful. Sixty is consistent with the rule.

Exercise 17: Deployment Monitoring Trigger

Baseline false-positive workload is:

92

per 1000 screened cases. The monitoring trigger is a relative increase greater than:

25\%

Current workload is:

121

per 1000 cases. Decide whether the trigger fires.

Solution

Increase:

\Delta=121-92=29

Relative increase:

\displaystyle \frac{29}{92}(100)=31.5\%

The trigger fires.

Engineering Comment

False-positive drift can reflect prevalence shift, acquisition change, calibration drift, threshold mismatch or user workflow change. It should trigger diagnostic review before the queue becomes unsafe.

Plausibility Check

An increase of about thirty on a baseline around ninety is about one third, so it exceeds 25\%.

Exercise 18: Diagnostic Release Gate

A diagnostic-system release review reports:

Evidence itemResult
sensitivity lower-bound gatepass
specificity lower-bound gatepass
subgroup sensitivity gapfail
threshold workload gatepass
reader agreementpass
latency margin+5\ \text{s}
post-deployment trigger planmissing

Decide release status.

Solution

The subgroup gap fails and the post-deployment trigger plan is missing. Therefore:

\text{status}=\text{hold or restricted release}

The threshold and average validation metrics are not enough to release the system for broad deployment.

Engineering Comment

Diagnostic release is not only a dataset score. Subgroup performance and monitoring determine whether the claim remains safe in the intended-use population.

Plausibility Check

The review has both statistical passes and release-critical evidence failures. A restricted or held decision is defensible.

Engineering Boundary Notes

Diagnostic validation is narrower than general device validation here. These exercises focus on score, threshold, prevalence, reference method, diagnostic workup and post-deployment monitoring for imaging and measurement systems.

Sensitivity and specificity are not deployment performance by themselves. Predictive values, workload, subgroup behavior and monitoring triggers determine whether a diagnostic claim can operate safely.

Thresholds should be controlled design inputs. Changing a threshold changes intended use, risk balance, workflow and validation evidence.

Common Release Mistakes

  • reporting sensitivity and specificity without confidence intervals;
  • ignoring intended-use prevalence;
  • selecting a threshold from Youden index alone;
  • accepting high sensitivity while false-positive workload exceeds capacity;
  • ignoring subgroup performance gaps;
  • using a weak or inconsistent reference standard;
  • releasing from curated data without deployment monitoring;
  • treating calibration drift as harmless if average accuracy remains high;
  • evaluating latency without queueing or peak-load data;
  • updating a model without enough regression evidence.

Validation Package Checklist

  • intended-use population stated;
  • reference standard and adjudication method defined;
  • dataset split and case mix documented;
  • locked score and threshold recorded;
  • sensitivity, specificity and uncertainty calculated;
  • PPV and NPV checked at deployment prevalence;
  • false-positive and false-negative workload estimated;
  • subgroup performance reviewed;
  • reader or reference agreement checked where relevant;
  • latency and queue capacity evaluated;
  • update and rollback evidence defined;
  • post-deployment monitoring triggers approved;
  • final decision states normal release, restricted release, hold, retest or monitoring-only use.
REF

See also