CDC (Change Data Capture)

Change Data Capture (CDC) is a technology that identifies and captures changes (inserts, updates, deletes) in a database, propagating these changes to other systems in real-time or near-real-time.

CDC is critical for data integration, real-time analytics, data synchronization, and microservices architectures, ensuring consistency across disparate data sources.

Types of Data Changes

  • Insert: New records added.
  • Update: Existing records modified.
  • Delete: Records removed.

CDC Objectives

  • Realtime Propagation: Minimize latency between change occurrence and delivery.
  • Data Integrity: Capture all changes without loss.
  • Order Preservation: Maintain the sequence of changes (e.g., prevent an update from overriding a later delete).
 

Core Values

Compared with traditional full-data ETL, CDC only processes changed data, which significantly reduces the IO of the source database, network bandwidth, and target storage costs. It supports real-time/near-real-time synchronization while retaining the history of data changes.

Application Scenarios

  • Data warehouse/data mart construction – incremental data synchronization to maintain cross-service data consistency.
  • Real-time updates of caches and search engines (e.g., Redis, Elasticsearch).
  • Database disaster recovery and migration to ensure data continuity.
  • Real-time analytics and event-driven applications (e.g., risk control, recommendation systems).

Implementation Modes

Mode Core Principle Advantages Disadvantages Applicable Scenarios
Log-based Parse database transaction logs (MySQL binlog, PostgreSQL WAL, Oracle Redo Log) High real-time performance (millisecond level), low intrusiveness, supports full + incremental data synchronization, no data loss Requires enabling database logs, relies on log formats, some databases need specific permissions High-concurrency real-time integration systems, core business systems
Query-based Periodically poll incremental fields (e.g., update_time > last_sync) Simple to implement, no need to modify the source database, no permission dependencies High latency (minute/hour level), potential data omission risk, repeated scanning may occur Non-real-time scenarios, small-scale systems, low-cost integration solutions
Trigger-based Deploy database triggers to capture data changes at the transaction level Direct capture of change events, high accuracy, compatible with most databases High intrusiveness (occupies database resources), may affect transaction performance, trigger logic maintenance is complex Legacy system transformation with limited log access, small-scale data synchronization

see also: