Glossary term

Change Data Capture

Engineering definition of change data capture covering database log capture, offsets, lag, replay, schema changes, retention risk and validation.

Definition

concept

Change data capture is a data-integration pattern that detects committed changes in a source system and delivers them to downstream consumers as change events or replicated records.

Change data capture appears in distributed services, analytics pipelines, telemetry platforms, integration layers and operational replicas when downstream systems need database changes without polling full tables. A useful design states the source log or trigger mechanism, capture offset, ordering rule, schema-change behavior, replay contract, idempotency key, delivery target, lag metric, retention limit and validation evidence.

Change data capture is a data-integration pattern that detects committed changes in a source system and delivers them to downstream consumers as change events or replicated records. It is used when downstream systems need a stream of database changes without repeatedly scanning full tables.

CDC appears in distributed services, analytics pipelines, telemetry platforms, operational replicas, audit systems and integration layers. It can reduce polling load and improve freshness, but it introduces capture offsets, lag, schema-change handling, replay behavior and downstream idempotency requirements.

Capture Source

Let the source commit log position be:

L_{src}

and the captured position be:

L_{cap}

The capture system is current only when:

L_{cap}=L_{src}

In practice, capture is usually behind. The engineering question is whether the lag is bounded and acceptable for the downstream use case.

Capture Lag

If the source log grows at:

\lambda_{log}

and the capture connector reads and publishes at:

\mu_{cap}

then byte lag grows when:

\lambda_{log}>\mu_{cap}

The backlog growth rate is:

g=\lambda_{log}-\mu_{cap}

This lag affects data age, alerting, replication freshness and any workflow that assumes downstream systems have seen recent writes.

Offsets and Replay

CDC systems need a durable offset. If the connector crashes after publishing but before committing its offset, it may replay events. If it commits the offset before publishing safely, it may lose events. The design should state which side effect is allowed and how consumers handle duplicates.

For event identifier:

ID_e

a consumer should apply side effects only when:

ID_e\notin S_{seen}

or otherwise prove the operation is idempotent.

Ordering and Transactions

A CDC stream may preserve source commit order, table order, partition order or only connector-specific order. Multi-row transactions should be represented so consumers know whether they are seeing partial transaction state or a complete committed unit.

If transaction:

X_k

contains changes:

c_1,c_2,\ldots,c_m

the downstream contract should state whether consumers may observe:

c_j

before all changes in:

X_k

are available.

Schema Changes

Schema changes are operational events, not only database administration. Adding a nullable column may be harmless. Renaming a field, changing a type or altering primary keys can break consumers, replay tooling and deduplication. The CDC contract should include schema versioning and compatibility rules.

The safest design treats schema metadata as part of the event stream or deploys producers and consumers through a compatibility window.

Retention Risk

Source logs are retained for a finite period or size. If lag grows past retention, the connector may no longer be able to resume from its stored offset.

Let retained log capacity be:

B_{ret}

and current lag be:

B_{lag}

When lag grows at rate:

g

time to retention exhaustion is:

\displaystyle T_{ret}=\frac{B_{ret}-B_{lag}}{g}

This is a reliability deadline for the connector.

Failure Modes

Common failure modes include connector lag, offset loss, duplicate events, missed events, schema incompatibility, poison records, log retention exhaustion, partition reordering, consumer idempotency gaps, replay tools that cannot filter safely and dashboards that show connector up while oldest event age is unacceptable.

CDC should not be used as a hidden dependency. If downstream control, reporting or customer-visible state depends on CDC freshness, lag must be part of the service health model.

When It Does Not Fit

CDC is a weak fit when downstream commands require immediate confirmation, when source schemas change without coordination, or when the source system cannot retain logs long enough for outages. It is also risky when consumers need business intent that is not visible in row-level changes. In those cases a transactional outbox, explicit domain event or synchronous API boundary may be easier to validate.

Worked Check

Suppose the source log grows at:

\lambda_{log}=80\ \text{MB/min}

and capture throughput is:

\mu_{cap}=60\ \text{MB/min}

The lag growth rate is:

g=80-60=20\ \text{MB/min}

If retained log capacity is:

B_{ret}=2400\ \text{MB}

and current lag is:

B_{lag}=600\ \text{MB}

then:

\displaystyle T_{ret}=\frac{2400-600}{20}=90\ \text{min}

The team has less than 90 minutes to restore capture throughput before resume-from-offset becomes unsafe.

Validation Evidence

Useful evidence includes source log position, captured offset, oldest unprocessed event age, capture throughput, publish throughput, duplicate-event tests, replay tests, schema-change tests, retention-margin alerts, consumer idempotency tests and recovery drills after connector restart.

A strong CDC review states what changes are captured, what changes are ignored, how ordering is represented, how replay is performed and how lag maps to data freshness or operational risk.

REF

See also