Glossary term
Saga Pattern
Engineering definition of the saga pattern covering distributed workflow steps, compensating actions, orchestration, choreography, idempotency, recovery and validation.
Definition
conceptThe saga pattern is a distributed transaction pattern that coordinates a sequence of local transactions with compensating actions instead of relying on one global atomic commit.
Saga patterns appear in distributed services, message-driven workflows, industrial platforms, order systems, provisioning workflows and long-running business processes when each participant owns its own data and a global transaction is unavailable or undesirable. A useful design states the workflow steps, local commit boundary, compensation for each completed step, orchestration or choreography rule, idempotency key, timeout behavior, retry policy, recovery log and validation evidence.
The saga pattern is a distributed transaction pattern that coordinates a sequence of local transactions with compensating actions instead of relying on one global atomic commit. It is used when each service, actor or subsystem owns its own data and a long-running workflow cannot hold one database transaction open across every participant.
Saga designs appear in distributed services, message-driven systems, provisioning workflows, order fulfillment, industrial platforms and control gateways. They trade atomic rollback for explicit recovery logic. That recovery logic must be designed, tested and observable.
Step Model
Let the saga contain ordered steps:
Each step commits locally. If step:
succeeds, its effect is durable at that participant. A later failure cannot simply roll back every prior database commit with one global transaction.
Compensation
Each completed step should have a compensating action:
that reverses, offsets or reconciles the business effect of:
Compensation is not always perfect. Refunding a payment, cancelling a shipment, restoring a configuration or marking a command as superseded may leave audit records, latency, inventory movement or operator work behind. The design should state which effects are fully reversible and which require reconciliation.
Orchestration and Choreography
An orchestrated saga has a coordinator that decides the next step and triggers compensations. A choreographed saga lets participants publish events and react to each other. Orchestration is easier to trace but can create a central workflow dependency. Choreography reduces central coupling but can make ordering, recovery and ownership harder to inspect.
The engineering choice should match the failure mode. A safety-critical or operator-visible workflow often benefits from explicit orchestration and a recovery log. A high-volume event workflow may favor choreography only if observability and idempotency are strong.
Success Probability
If each step has success probability:
and failures are treated as independent for a rough planning screen, the probability that all steps complete is:
This is not a proof of reliability, but it shows why long sagas need careful recovery. Many individually reliable steps can still produce a meaningful compensation rate.
Idempotency and Retries
Saga steps and compensations should be idempotent or protected by an idempotency key. A retry may repeat after a timeout even though the previous attempt eventually committed.
For attempt count:
and retry budget:
the step should stop retrying when:
and move to an explicit recovery state instead of creating a retry storm.
Timeout and Recovery Log
A saga needs a durable record of current step, completed effects, pending compensations and terminal state. If the workflow timeout is:
then the system should prove:
for the failure cases where a bounded recovery time is required.
Failure Modes
Common failure modes include missing compensation, compensation that is not idempotent, coordinator state loss, duplicate event handling, ambiguous timeout outcome, out-of-order events, permanent partial completion, retry storms, poison messages, manual repair with no audit trail and monitoring that reports only started workflows rather than stuck or compensating workflows.
The most common mistake is to call any chain of service calls a saga. A defensible saga has explicit local commit boundaries, compensation logic, recovery state and tests for failures after every step.
When It Does Not Fit
A saga is a weak fit when the workflow requires immediate atomic visibility, when compensation is unsafe, or when partial completion would violate a safety or regulatory boundary. In those cases the system may need a smaller transaction boundary, stronger consistency, manual approval or a redesigned command model.
Worked Check
Suppose a workflow has five local steps, each with estimated success probability:
The rough all-steps success probability is:
So the compensation or recovery path may be exercised by approximately:
of attempted workflows under those assumptions. If the business process cannot tolerate a 14.13 percent recovery path, the design needs fewer steps, more reliable dependencies, stronger admission control, prevalidation or a different transaction boundary.
Validation Evidence
Useful evidence includes workflow-state traces, step-level success and failure rates, compensation tests after every committed step, idempotency-key tests, retry-budget tests, timeout ambiguity tests, duplicate-event tests, recovery-log replay, dead-letter handling, operator repair drills and dashboards showing active, stuck, compensating and terminal workflows.
A strong saga review states what consistency is guaranteed, what inconsistency is temporary, what inconsistency requires manual repair and how the system proves that every partially completed workflow eventually reaches an acceptable terminal state.