Glossary term

Saga Pattern

Engineering definition of the saga pattern covering distributed workflow steps, compensating actions, orchestration, choreography, idempotency, recovery and validation.

Definition

concept

The saga pattern is a distributed transaction pattern that coordinates a sequence of local transactions with compensating actions instead of relying on one global atomic commit.

Saga patterns appear in distributed services, message-driven workflows, industrial platforms, order systems, provisioning workflows and long-running business processes when each participant owns its own data and a global transaction is unavailable or undesirable. A useful design states the workflow steps, local commit boundary, compensation for each completed step, orchestration or choreography rule, idempotency key, timeout behavior, retry policy, recovery log and validation evidence.

The saga pattern is a distributed transaction pattern that coordinates a sequence of local transactions with compensating actions instead of relying on one global atomic commit. It is used when each service, actor or subsystem owns its own data and a long-running workflow cannot hold one database transaction open across every participant.

Saga designs appear in distributed services, message-driven systems, provisioning workflows, order fulfillment, industrial platforms and control gateways. They trade atomic rollback for explicit recovery logic. That recovery logic must be designed, tested and observable.

Step Model

Let the saga contain ordered steps:

S_1,S_2,\ldots,S_n

Each step commits locally. If step:

S_i

succeeds, its effect is durable at that participant. A later failure cannot simply roll back every prior database commit with one global transaction.

Compensation

Each completed step should have a compensating action:

C_i

that reverses, offsets or reconciles the business effect of:

S_i

Compensation is not always perfect. Refunding a payment, cancelling a shipment, restoring a configuration or marking a command as superseded may leave audit records, latency, inventory movement or operator work behind. The design should state which effects are fully reversible and which require reconciliation.

Orchestration and Choreography

An orchestrated saga has a coordinator that decides the next step and triggers compensations. A choreographed saga lets participants publish events and react to each other. Orchestration is easier to trace but can create a central workflow dependency. Choreography reduces central coupling but can make ordering, recovery and ownership harder to inspect.

The engineering choice should match the failure mode. A safety-critical or operator-visible workflow often benefits from explicit orchestration and a recovery log. A high-volume event workflow may favor choreography only if observability and idempotency are strong.

Success Probability

If each step has success probability:

P_i

and failures are treated as independent for a rough planning screen, the probability that all steps complete is:

P_{saga}\approx \prod_{i=1}^{n}P_i

This is not a proof of reliability, but it shows why long sagas need careful recovery. Many individually reliable steps can still produce a meaningful compensation rate.

Idempotency and Retries

Saga steps and compensations should be idempotent or protected by an idempotency key. A retry may repeat after a timeout even though the previous attempt eventually committed.

For attempt count:

R_i

and retry budget:

R_{max}

the step should stop retrying when:

R_i>R_{max}

and move to an explicit recovery state instead of creating a retry storm.

Timeout and Recovery Log

A saga needs a durable record of current step, completed effects, pending compensations and terminal state. If the workflow timeout is:

T_w

then the system should prove:

T_{detect}+T_{recover}\leq T_w

for the failure cases where a bounded recovery time is required.

Failure Modes

Common failure modes include missing compensation, compensation that is not idempotent, coordinator state loss, duplicate event handling, ambiguous timeout outcome, out-of-order events, permanent partial completion, retry storms, poison messages, manual repair with no audit trail and monitoring that reports only started workflows rather than stuck or compensating workflows.

The most common mistake is to call any chain of service calls a saga. A defensible saga has explicit local commit boundaries, compensation logic, recovery state and tests for failures after every step.

When It Does Not Fit

A saga is a weak fit when the workflow requires immediate atomic visibility, when compensation is unsafe, or when partial completion would violate a safety or regulatory boundary. In those cases the system may need a smaller transaction boundary, stronger consistency, manual approval or a redesigned command model.

Worked Check

Suppose a workflow has five local steps, each with estimated success probability:

P_i=0.97

The rough all-steps success probability is:

P_{saga}=0.97^5=0.8587

So the compensation or recovery path may be exercised by approximately:

1-0.8587=0.1413

of attempted workflows under those assumptions. If the business process cannot tolerate a 14.13 percent recovery path, the design needs fewer steps, more reliable dependencies, stronger admission control, prevalidation or a different transaction boundary.

Validation Evidence

Useful evidence includes workflow-state traces, step-level success and failure rates, compensation tests after every committed step, idempotency-key tests, retry-budget tests, timeout ambiguity tests, duplicate-event tests, recovery-log replay, dead-letter handling, operator repair drills and dashboards showing active, stuck, compensating and terminal workflows.

A strong saga review states what consistency is guaranteed, what inconsistency is temporary, what inconsistency requires manual repair and how the system proves that every partially completed workflow eventually reaches an acceptable terminal state.

REF

See also