Exercise set
Diagnostic System Validation, Threshold, and Workflow Exercises
Worked diagnostic-system exercises for sensitivity, specificity, predictive values, thresholds, false-positive workload, reader agreement, monitoring and release gates.
These exercises practise validation and workflow decisions for diagnostic systems linked to biomedical imaging and measurement. They cover sensitivity, specificity, predictive values, thresholds, false-positive workload, confidence intervals, subgroup gaps, measurement uncertainty, throughput, latency, reader agreement, prevalence shift, model update evidence, monitoring triggers and release gates.
The focus is narrower than general medical-device validation. A diagnostic system must prove that a locked score, threshold, dataset, reference standard, deployment prevalence and clinical workflow can support the intended decision without creating unacceptable misses, false positives or operational burden.
How to Use These Exercises
For each problem, define:
- the diagnostic task and intended-use population;
- the reference standard and dataset boundary;
- the locked model, score, threshold or reader rule;
- the workflow constraint: queue, false-positive workup, latency, confirmatory test or escalation;
- the monitoring evidence required after deployment.
The common mistake is reporting sensitivity and specificity as if they completely define a diagnostic device. In deployment, prevalence, subgroup mix, threshold, workflow capacity, false-positive burden, reference quality and monitoring drift can dominate the release decision.
Release Evidence Notes
Diagnostic-performance evidence should match deployment. A curated validation set may not represent the prevalence, acquisition variability, user workflow or subgroup distribution encountered in service.
Threshold evidence should make trade-offs visible. Lower thresholds can improve sensitivity while overloading clinical workup with false positives. Higher thresholds can reduce workload while missing intended positives.
Workflow evidence should be quantitative. A diagnostic system that is statistically good but creates more cases than the clinical pathway can review safely is not release-ready.
Post-deployment monitoring should have triggers. Drift in prevalence, score distribution, false-positive rate, turnaround time, reader disagreement or QA phantom metrics should trigger review before performance erodes silently.
Scenario Map
| Scenario | Exercises | Primary calculation | Engineering decision |
|---|---|---|---|
| Validation metrics | 1-6 | sensitivity, specificity, confidence interval and subgroup gap | Decide whether dataset evidence supports the claim. |
| Threshold and prevalence | 7-10, 15 | predictive values, Youden index, threshold workload and prevalence shift | Decide whether the chosen threshold fits deployment. |
| Workflow and monitoring | 11-18 | queue capacity, latency, reader agreement, model update sample, monitoring trigger and release gate | Decide whether the diagnostic system can operate safely. |
Exercise 1: Sensitivity and Specificity
A validation dataset has:
| Outcome | Count |
|---|---|
| true positives | 90 |
| false negatives | 10 |
| true negatives | 180 |
| false positives | 20 |
Calculate sensitivity and specificity.
Solution
Sensitivity:
Specificity:
Both are:
Engineering Comment
The equal percentages are convenient but incomplete. Confidence intervals, subgroup behavior, prevalence and workflow burden still determine release.
Plausibility Check
Each denominator is clean: 100 positives and 200 negatives. The rates are exactly 90\%.
Exercise 2: Predictive Values at Deployment Prevalence
Use:
Deployment prevalence is:
Calculate PPV and NPV.
Solution
Positive predictive value:
Negative predictive value:
Engineering Comment
At low prevalence, false positives can outnumber true positives even with high specificity. PPV is a deployment metric, not only a model metric.
Plausibility Check
Only 8\% of cases are positive, so PPV below 50\% is plausible while NPV is very high.
Exercise 3: False-Positive Workload per 1000 Cases
For the deployment in Exercise 2, calculate expected false-positive workups per:
screened cases.
Solution
Negative cases:
False positives:
Engineering Comment
Ninety-two workups per 1000 screened cases may be acceptable or excessive depending on staffing, patient risk, confirmatory testing and communication pathway.
Plausibility Check
Ten percent of about nine hundred negative cases should be about ninety false positives.
Exercise 4: Threshold Selection by Youden Index
Three thresholds have validation performance:
| Threshold | Sensitivity | Specificity |
|---|---|---|
| 0.35 | 0.96 | 0.80 |
| 0.50 | 0.90 | 0.90 |
| 0.65 | 0.78 | 0.95 |
Calculate Youden index:
Solution
For threshold 0.35:
For threshold 0.50:
For threshold 0.65:
The highest Youden index is at threshold:
Engineering Comment
Youden index is useful but not sufficient. It does not include prevalence, false-positive workload, false-negative severity or workflow capacity.
Plausibility Check
The balanced threshold has the highest combined sensitivity and specificity. The result fits the table.
Exercise 5: Sensitivity Confidence Interval Screen
A validation set has:
Use the simple standard error approximation:
and approximate 95\% half-width:
Calculate half-width for sensitivity.
Solution
Sensitivity estimate:
Standard error:
Half-width:
Therefore the approximate half-width is:
Engineering Comment
Point estimates without uncertainty can overstate confidence. A release claim may need a lower-bound requirement, not only a nominal rate.
Plausibility Check
One hundred positive cases gives a confidence band of several percentage points. The result is plausible.
Exercise 6: Subgroup Sensitivity Gap
Overall sensitivity is:
A subgroup sensitivity is:
The allowed subgroup gap is:
Check the gap.
Solution
Gap:
Since:
the subgroup gap fails.
Engineering Comment
Overall validation can hide subgroup failure. The release package should identify whether the subgroup is within intended use, needs warning, retraining, data expansion or exclusion.
Plausibility Check
The subgroup result is visibly lower than the overall result by twelve percentage points, so it exceeds an eight-point limit.
Exercise 7: Measurement Uncertainty for Vessel Diameter
An imaging tool measures vessel diameter:
Independent standard uncertainties are:
| Source | Standard uncertainty |
|---|---|
| pixel calibration | 0.04\ \text{mm} |
| segmentation repeatability | 0.07\ \text{mm} |
| motion residual | 0.05\ \text{mm} |
Calculate combined standard uncertainty.
Solution
RSS uncertainty:
Engineering Comment
If clinical decisions depend on small diameter changes, measurement uncertainty should be compared with the decision threshold, not only reported after the fact.
Plausibility Check
The largest component is 0.07\ \text{mm} and RSS of three small terms should be just under 0.1\ \text{mm}.
Exercise 8: Decision Limit with Guard Band
A treatment decision threshold is:
Measured diameter is:
Use expanded uncertainty:
Check whether the guarded upper value exceeds the decision limit.
Solution
Guarded upper value:
Since:
the guarded measurement remains below the threshold.
Engineering Comment
Guard bands prevent release decisions from depending on nominal measurements that are too close to a clinical threshold.
Plausibility Check
The uncertainty adds less than 0.2\ \text{mm}, so the guarded value remains below 3.5\ \text{mm}.
Exercise 9: Diagnostic Throughput and Queue Risk
A diagnostic review station can process:
The incoming rate during screening is:
An algorithm adds false-positive workups at:
Check capacity.
Solution
Total demand:
Capacity margin:
The station is overloaded.
Engineering Comment
A model can pass statistical validation and still fail deployment if its false-positive load exceeds review capacity. Workflow is part of diagnostic release.
Plausibility Check
Adding four workups to fifteen cases creates nineteen cases per hour, which is greater than capacity.
Exercise 10: Workload-Limited Threshold Gate
At prevalence:
per 1000 screened cases, threshold 0.35 has specificity:
The false-positive workload limit is:
per 1000 cases. Check threshold 0.35.
Solution
Negative cases:
False positives:
Since:
threshold 0.35 fails the workload gate.
Engineering Comment
High sensitivity can be unacceptable if it creates too many false-positive workups for the pathway. The threshold must be released with workflow evidence.
Plausibility Check
Twenty percent false positives among nine hundred negative cases gives one hundred eighty false positives.
Exercise 11: Diagnostic Latency Gate
A diagnostic pipeline includes:
| Stage | Time |
|---|---|
| image transfer | 12\ \text{s} |
| inference | 8\ \text{s} |
| QA check | 20\ \text{s} |
| result routing | 15\ \text{s} |
The intended-use requirement is:
Calculate total latency and margin.
Solution
Total latency:
Margin:
The latency gate passes narrowly.
Engineering Comment
Latency should be checked under peak load and failure recovery. A five-second margin can disappear with queueing, network delay or manual review.
Plausibility Check
The four stages are all tens of seconds or less, and their sum is below one minute.
Exercise 12: Reader Agreement Kappa
Two readers classify 120 cases. They agree on 96 cases. Expected agreement by class prevalence is estimated as:
Calculate Cohen’s kappa:
Solution
Observed agreement:
Kappa:
Engineering Comment
Reader agreement affects reference quality and workflow consistency. A diagnostic validation set is weak if labels are unstable or arbitration rules are unclear.
Plausibility Check
Observed agreement is well above expected agreement, so kappa is positive and moderate.
Exercise 13: Calibration Drift and Threshold Bias
A score calibration check shows the deployed system scores:
higher than validation on average. The locked threshold is:
Estimate the effective validation-equivalent threshold.
Solution
If deployed scores are biased high, a deployed threshold of 0.50 corresponds to validation score:
Engineering Comment
Calibration drift can silently lower the effective threshold, increasing sensitivity and false-positive workload. Score calibration should be monitored after deployment.
Plausibility Check
A positive score bias makes cases cross the threshold more easily, so the validation-equivalent threshold should be lower.
Exercise 14: Missed-Case Risk per Screened Population
A deployment screens:
cases per month. Prevalence is:
Sensitivity is:
Estimate expected false negatives per month.
Solution
Positive cases:
False negatives:
Engineering Comment
False-negative count communicates clinical impact more directly than sensitivity alone. Risk controls should address the expected missed cases and their severity.
Plausibility Check
Eight percent of three hundred positives is twenty-four, so the result is exact.
Exercise 15: Prevalence Shift PPV Change
A diagnostic threshold has:
Calculate PPV when prevalence increases from:
to:
Solution
At p_1=0.08:
At p_2=0.20:
Engineering Comment
Deployment prevalence changes predictive values even when sensitivity and specificity stay fixed. Monitoring should compare actual case mix with the release basis.
Plausibility Check
Higher prevalence should increase PPV because true positives become more common relative to false positives. The result does.
Exercise 16: Model Update Validation Sample
A model update must demonstrate at least:
zero-failure confidence that a critical software check catches a known invalid input. Use the simple rule:
for zero observed failures. If the tolerated miss probability is:
estimate required test cases.
Solution
Required tests:
Engineering Comment
This screen does not replace a full validation plan, but it prevents a model update from being released from only a few cherry-picked regression examples.
Plausibility Check
A five percent tolerated miss probability should require tens of zero-failure cases, not just a handful. Sixty is consistent with the rule.
Exercise 17: Deployment Monitoring Trigger
Baseline false-positive workload is:
per 1000 screened cases. The monitoring trigger is a relative increase greater than:
Current workload is:
per 1000 cases. Decide whether the trigger fires.
Solution
Increase:
Relative increase:
The trigger fires.
Engineering Comment
False-positive drift can reflect prevalence shift, acquisition change, calibration drift, threshold mismatch or user workflow change. It should trigger diagnostic review before the queue becomes unsafe.
Plausibility Check
An increase of about thirty on a baseline around ninety is about one third, so it exceeds 25\%.
Exercise 18: Diagnostic Release Gate
A diagnostic-system release review reports:
| Evidence item | Result |
|---|---|
| sensitivity lower-bound gate | pass |
| specificity lower-bound gate | pass |
| subgroup sensitivity gap | fail |
| threshold workload gate | pass |
| reader agreement | pass |
| latency margin | +5\ \text{s} |
| post-deployment trigger plan | missing |
Decide release status.
Solution
The subgroup gap fails and the post-deployment trigger plan is missing. Therefore:
The threshold and average validation metrics are not enough to release the system for broad deployment.
Engineering Comment
Diagnostic release is not only a dataset score. Subgroup performance and monitoring determine whether the claim remains safe in the intended-use population.
Plausibility Check
The review has both statistical passes and release-critical evidence failures. A restricted or held decision is defensible.
Engineering Boundary Notes
Diagnostic validation is narrower than general device validation here. These exercises focus on score, threshold, prevalence, reference method, diagnostic workup and post-deployment monitoring for imaging and measurement systems.
Sensitivity and specificity are not deployment performance by themselves. Predictive values, workload, subgroup behavior and monitoring triggers determine whether a diagnostic claim can operate safely.
Thresholds should be controlled design inputs. Changing a threshold changes intended use, risk balance, workflow and validation evidence.
Common Release Mistakes
- reporting sensitivity and specificity without confidence intervals;
- ignoring intended-use prevalence;
- selecting a threshold from Youden index alone;
- accepting high sensitivity while false-positive workload exceeds capacity;
- ignoring subgroup performance gaps;
- using a weak or inconsistent reference standard;
- releasing from curated data without deployment monitoring;
- treating calibration drift as harmless if average accuracy remains high;
- evaluating latency without queueing or peak-load data;
- updating a model without enough regression evidence.
Validation Package Checklist
- intended-use population stated;
- reference standard and adjudication method defined;
- dataset split and case mix documented;
- locked score and threshold recorded;
- sensitivity, specificity and uncertainty calculated;
- PPV and NPV checked at deployment prevalence;
- false-positive and false-negative workload estimated;
- subgroup performance reviewed;
- reader or reference agreement checked where relevant;
- latency and queue capacity evaluated;
- update and rollback evidence defined;
- post-deployment monitoring triggers approved;
- final decision states normal release, restricted release, hold, retest or monitoring-only use.