Glossary term
Change Data Capture
Engineering definition of change data capture covering database log capture, offsets, lag, replay, schema changes, retention risk and validation.
Definition
conceptChange data capture is a data-integration pattern that detects committed changes in a source system and delivers them to downstream consumers as change events or replicated records.
Change data capture appears in distributed services, analytics pipelines, telemetry platforms, integration layers and operational replicas when downstream systems need database changes without polling full tables. A useful design states the source log or trigger mechanism, capture offset, ordering rule, schema-change behavior, replay contract, idempotency key, delivery target, lag metric, retention limit and validation evidence.
Change data capture is a data-integration pattern that detects committed changes in a source system and delivers them to downstream consumers as change events or replicated records. It is used when downstream systems need a stream of database changes without repeatedly scanning full tables.
CDC appears in distributed services, analytics pipelines, telemetry platforms, operational replicas, audit systems and integration layers. It can reduce polling load and improve freshness, but it introduces capture offsets, lag, schema-change handling, replay behavior and downstream idempotency requirements.
Capture Source
Let the source commit log position be:
and the captured position be:
The capture system is current only when:
In practice, capture is usually behind. The engineering question is whether the lag is bounded and acceptable for the downstream use case.
Capture Lag
If the source log grows at:
and the capture connector reads and publishes at:
then byte lag grows when:
The backlog growth rate is:
This lag affects data age, alerting, replication freshness and any workflow that assumes downstream systems have seen recent writes.
Offsets and Replay
CDC systems need a durable offset. If the connector crashes after publishing but before committing its offset, it may replay events. If it commits the offset before publishing safely, it may lose events. The design should state which side effect is allowed and how consumers handle duplicates.
For event identifier:
a consumer should apply side effects only when:
or otherwise prove the operation is idempotent.
Ordering and Transactions
A CDC stream may preserve source commit order, table order, partition order or only connector-specific order. Multi-row transactions should be represented so consumers know whether they are seeing partial transaction state or a complete committed unit.
If transaction:
contains changes:
the downstream contract should state whether consumers may observe:
before all changes in:
are available.
Schema Changes
Schema changes are operational events, not only database administration. Adding a nullable column may be harmless. Renaming a field, changing a type or altering primary keys can break consumers, replay tooling and deduplication. The CDC contract should include schema versioning and compatibility rules.
The safest design treats schema metadata as part of the event stream or deploys producers and consumers through a compatibility window.
Retention Risk
Source logs are retained for a finite period or size. If lag grows past retention, the connector may no longer be able to resume from its stored offset.
Let retained log capacity be:
and current lag be:
When lag grows at rate:
time to retention exhaustion is:
This is a reliability deadline for the connector.
Failure Modes
Common failure modes include connector lag, offset loss, duplicate events, missed events, schema incompatibility, poison records, log retention exhaustion, partition reordering, consumer idempotency gaps, replay tools that cannot filter safely and dashboards that show connector up while oldest event age is unacceptable.
CDC should not be used as a hidden dependency. If downstream control, reporting or customer-visible state depends on CDC freshness, lag must be part of the service health model.
When It Does Not Fit
CDC is a weak fit when downstream commands require immediate confirmation, when source schemas change without coordination, or when the source system cannot retain logs long enough for outages. It is also risky when consumers need business intent that is not visible in row-level changes. In those cases a transactional outbox, explicit domain event or synchronous API boundary may be easier to validate.
Worked Check
Suppose the source log grows at:
and capture throughput is:
The lag growth rate is:
If retained log capacity is:
and current lag is:
then:
The team has less than 90 minutes to restore capture throughput before resume-from-offset becomes unsafe.
Validation Evidence
Useful evidence includes source log position, captured offset, oldest unprocessed event age, capture throughput, publish throughput, duplicate-event tests, replay tests, schema-change tests, retention-margin alerts, consumer idempotency tests and recovery drills after connector restart.
A strong CDC review states what changes are captured, what changes are ignored, how ordering is represented, how replay is performed and how lag maps to data freshness or operational risk.