Dive deep into snapshots, incremental streams, and checkpoint mechanisms to master reliable data pipelines.Dive deep into snapshots, incremental streams, and checkpoint mechanisms to master reliable data pipelines.

SeaTunnel CDC Under the Hood: Snapshots, Backfills, and Why Your Checkpoints Time Out

Recently, while using SeaTunnel CDC to synchronize real-time data from Oracle, MySQL, and SQL Server to other relational databases, I spent time reading and modifying the source code of SeaTunnel and Debezium. Through this process, I gained an initial understanding of how the SeaTunnel CDC Source is implemented.

While everything was still fresh, I decided to organize some of the questions I encountered and address common confusions. I will try to explain things in a more approachable way. These are purely my personal understandings—if anything is incorrect, I welcome corrections.

The main topics covered in this article are:

1. The Stages of CDC: Snapshot, Backfill, and Incremental

The overall CDC data reading process can be divided into three phases:

Snapshot (full) → Backfill → Incremental

Snapshot Phase

As the name implies, the snapshot phase captures a snapshot of the current state of the database and reads all existing data. In SeaTunnel’s current implementation, this is done through pure JDBC reads.

During snapshot reading, SeaTunnel records the current binlog position. For MySQL, it executes:

SHOW MASTER STATUS;

which returns results such as:

File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set -------------+-----------+--------------+------------------+------------------ binlog.000011|1001373553 | | |

These values are stored as the low watermark.

Note that this operation is not performed only once.

To improve performance, SeaTunnel has designed its own split mechanism. You can refer to my other article for details. Assume the global parallelism is 10. SeaTunnel initializes 10 channels to execute tasks in parallel.

SeaTunnel first analyzes the number of tables, then splits each table based on the minimum and maximum values of the primary key. The default split size is 8096 rows.

For tables with large data volumes, this can result in more than 100 splits, which are randomly distributed across the 10 channels. At this stage, no data is actually read; SeaTunnel only prepares the SQL queries with WHERE conditions and stores them.

After all tables are split, each split is executed in parallel.

When each split (for example, \n ​SELECT * FROM user_orders WHERE order_id >= 1 AND order_id < 10001) begins execution:

  • SeaTunnel records the current binlog position as the low watermark for that split.
  • After the split finishes reading data, SeaTunnel executes SHOW MASTER STATUS again.
  • The returned position is recorded as the high watermark for that split.

Once one split finishes, the next split begins execution.

The corresponding code is shown below:

// MySqlSnapshotSplitReadTask.doExecute() protected SnapshotResult doExecute(...) { // ① Record low watermark BinlogOffset lowWatermark = currentBinlogOffset(jdbcConnection); dispatcher.dispatchWatermarkEvent(..., lowWatermark, WatermarkKind.LOW); // ② Read snapshot data createDataEvents(ctx, snapshotSplit.getTableId()); // ③ Record high watermark BinlogOffset highWatermark = currentBinlogOffset(jdbcConnection); dispatcher.dispatchWatermarkEvent(..., highWatermark, WatermarkKind.HIGH); }

Notes:

It is recommended to configure a larger split size, such as 100,000 rows. Practical experience shows that more splits do not necessarily lead to better performance.

Backfill Phase

The backfill phase has two modes, controlled by the exactly_once parameter.

  • exactly_once = false (default)

If exactly_once is disabled, SeaTunnel waits until all snapshot splits are completed. It then compares the watermarks of all splits and selects the minimum watermark.

From that point onward, SeaTunnel switches from JDBC reads to CDC log consumption:

  • MySQL: binlog
  • Oracle: redo log
  • SQL Server: CDC log

Log entries are parsed, and corresponding INSERT, UPDATE, or DELETE events are generated.

Each emitted record carries its own position or SCN offset. For each incoming record, SeaTunnel compares its offset with the high watermark. Once the offset exceeds the high watermark, the system transitions into the pure incremental phase.

  • exactly_once = true

When exactly_once is enabled:

  • Snapshot data for each split is not written immediately but cached in memory.
  • SeaTunnel starts a bounded log reading task:
  • Start offset: the split’s low watermark
  • End offset: the split’s high watermark

All change events within this range are parsed and cached.

Snapshot data and log data are then merged in memory. Records are compared by primary key. For example, if a row is inserted during snapshot and later updated during log reading, only the updated version is retained.

This guarantees exactly-once semantics, but it is very memory-intensive.

Incremental Phase

The incremental phase consists of pure log consumption.

If exactly_once is enabled, SeaTunnel starts a new unbounded stream beginning from the high watermark. If it is disabled, incremental reading continues directly from the backfill phase.

Conceptually, backfill reads from the low watermark, while incremental reading starts from the high watermark. The only difference lies in the starting offset.

Summary

1. Two Execution Modes

With exactly_once enabled (Exactly-Once)

  • Snapshot: reads full historical data at the low watermark
  • Backfill: fills the changes between the low and high watermarks
  • Incremental: consumes the unbounded stream after the high watermark

Costs:

  • More state to maintain
  • High memory pressure, especially with many tables and splits

Note:

The machanism of LogMiner

LogMiner is an internal Oracle process running ​inside the production database instance.

  • Each LogMiner session requires:

  • ~1 CPU core for parsing Redo Logs

  • ~500MB–1GB memory for caching and parsing

  • Continuous reading of Redo Log files (I/O operations)

Without enabling exactly_once (semantic: At-Least-Once)

  • Snapshot: still reading historical data

  • Incremental​: directly consume binlog from the low watermark (no separate backfill)

  • Differences:

  • No separate "backfill" step

  • Snapshot and incremental are in ​the same stream​, but ​not mixed:

    • Complete all snapshot splits first
    • Then switch to incremental consumption (from low watermark)

    Backfill and incremental are linked in ​the same stream, which can be considered as "backfill + incremental as one".

2. Will duplicates occur if exactly_once is not enabled?

Yes, under certain conditions:

  • Source side uses ​table/block split parallel SELECT:
  • For whole-database or multi-table scenarios with low parallelism → many SELECT blocks ​will be queued and delayed.
  • During queuing, if new data is inserted:
  • Some blocks’ snapshot SELECT may already see the new data
  • Subsequent incremental binlog will also read it again
  • Result: the same row is written twice to the Sink(typical At-Least-Once behavior)

3. How to minimize duplicate writes without exactly_once?

  • Solution: enable upsert​​ on the Sink (primary key idempotent)
  • JDBC Sink with enable_upsert​ → uses statements like MERGE INTO​ / REPLACE INTO
  • Use upsert for both snapshot and incremental phases:
    • Repeated primary keys overwrite previous values, final table contains only one row per key
  • Semantically: transport is At-Least-Once, downstream result approximates logical Exactly-Once
  • Cost:
  • Snapshot phase also uses upsert → slower than plain INSERT
  • If forcing exactly_once + in-memory filtering:
    • Many split blocks require tracking large amounts of offsets/primary keys in memory → high memory pressure
    • For Oracle (LogMiner-based source):
    • Each block starts an independent LogMiner/streaming session → high intrusion on production DB, increased latency

4. Practical Recommendations

  • Maximize performance, accept "at-least-once" + idempotent:
  • Disable exactly_once​, ​enable Sink upsert​, deduplicate by primary key
  • Few tables, manageable data, strict Exactly-Once requirement:
  • Consider enabling exactly_once​, but evaluate memory and source DB pressure(especially Oracle LogMiner scenarios)

2. How is CDC startup.modetimestamp implemented internally?

The timestamp mode specifies a point in time to sync data. Each database’s CDC mechanism differs, so the way to specify the timestamp is also different.

1. MySQL – Binary Search on Binlog Files

MySQL principle:

  • User specifies millisecond timestamp (e.g., 1734494400000)
  • SeaTunnel executes SHOW BINARY LOGS to get all binlog files
  • Performs binary search on binlog files, reading the timestamp of the first record in each file
  • Finds the first binlog file where timestamp >= specified time
  • Returns the filename and position 0, reads binlog from that position

2. Oracle – TIMESTAMPTOSCN Function

Oracle principle:

  • User specifies millisecond timestamp (e.g., 1763058616003)
  • SeaTunnel converts it to java.sql.Timestamp​, formatted as YYYY-MM-DD HH24:MI:SS.FF3
  • Executes SQL: SELECT TIMESTAMP_TO_SCN(TO_TIMESTAMP('2024-12-18 09:00:00.003', 'YYYY-MM-DD HH24:MI:SS.FF3')) FROM DUAL
  • Oracle built-in function TIMESTAMP_TO_SCN returns corresponding SCN (System Change Number)
  • Returns RedoLogOffset containing that SCN, reads redo log from that SCN
  • SCN can also be converted back to a timestamp: SELECT current_scn FROM v$database;​ and SELECT SCN_TO_TIMESTAMP('240158979') FROM DUAL;

Additionally, since Oracle reads redo logs directly, troubleshooting is difficult. The following SQL simulates Debezium starting a LogMiner session, useful for problem diagnosis:

-- Clean previous LogMiner session BEGIN DBMS_LOGMNR.END_LOGMNR; EXCEPTION WHEN OTHERS THEN NULL; END; SELECT * FROM V$LOGFILE ; -- Add current online log files DECLARE v_first BOOLEAN := TRUE; BEGIN FOR rec IN (SELECT MEMBER FROM V$LOGFILE WHERE TYPE='ONLINE' AND ROWNUM <= 3) LOOP IF v_first THEN DBMS_LOGMNR.ADD_LOGFILE(rec.MEMBER, DBMS_LOGMNR.NEW); DBMS_OUTPUT.PUT_LINE('Added log file: ' || rec.MEMBER); v_first := FALSE; ELSE DBMS_LOGMNR.ADD_LOGFILE(rec.MEMBER, DBMS_LOGMNR.ADDFILE); DBMS_OUTPUT.PUT_LINE('Added log file: ' || rec.MEMBER); END IF; END LOOP; END; -- Start LogMiner session BEGIN DBMS_LOGMNR.START_LOGMNR( OPTIONS => DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG + DBMS_LOGMNR.COMMITTED_DATA_ONLY ); DBMS_OUTPUT.PUT_LINE('LogMiner started successfully'); END; -- Query parsed content SELECT SCN, OPERATION, OPERATION_CODE, TABLE_NAME, TO_CHAR(TIMESTAMP, 'YYYY-MM-DD HH24:MI:SS') AS TIMESTAMP, CSF, INFO, SUBSTR(SQL_REDO, 1, 200) AS SQL_REDO_PREVIEW FROM V$LOGMNR_CONTENTS WHERE TABLE_NAME = 'XML_DEBUG_TEST' AND SEG_OWNER = USER ORDER BY SCN, SEQUENCE#; -- Clean LogMiner session BEGIN DBMS_LOGMNR.END_LOGMNR; EXCEPTION WHEN OTHERS THEN NULL; END;

3. PostgreSQL – Timestamp Not Supported

PostgreSQL principle:

  • Does not support timestamp mode
  • Uses LSN (Log Sequence Number) as offset
  • LSN is a 64-bit number representing WAL (Write-Ahead Log) position
  • No direct function to convert a timestamp to LSN
  • Users can only use INITIAL​, EARLIEST​, LATESTmodes

4. SQL Server –​sys.fn_cdc_map_time_to_lsn​​ Function

SQL Server principle:

  • User specifies millisecond timestamp (e.g., 1734494400000)

  • SeaTunnel converts to java.sql.Timestamp

  • Executes SQL: SELECT sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', ?) AS lsn

  • SQL Server built-in function returns the smallest LSN >= specified time

  • Returns LsnOffset containing byte array for that LSN, reads CDC log from that LSN

3. How SeaTunnel Checkpoint Mechanism Interacts with CDC Tasks

Checkpoint enables resume-from-failure. How does it work, and what should be noted?

First, understand the CK implementation: SeaTunnel triggers a checkpoint asynchronously at intervals: SourceFlowLifeCycle.triggerBarrier()

// SourceFlowLifeCycle.triggerBarrier() public void triggerBarrier(Barrier barrier) throws Exception { log.debug("source trigger barrier [{}]", barrier); // Key: acquire checkpoint lock to ensure state consistency synchronized (collector.getCheckpointLock()) { // Step 1: check if prepare to close if (barrier.prepareClose(this.currentTaskLocation)) { this.prepareClose = true; } // Step 2: snapshot state if (barrier.snapshot()) { List<byte[]> states = serializeStates(splitSerializer, reader.snapshotState(barrier.getId())); runningTask.addState(barrier, ActionStateKey.of(sourceAction), states); } // Step 3: acknowledge barrier runningTask.ack(barrier); // Step 4: Key! send barrier as Record downstream collector.sendRecordToNext(new Record<>(barrier)); } }

The checkpoint simulates a barrier record, a special marker that passes through iterators along the data flow: source → transform → sink. At each stage, the Barrier is evaluated:

  • Source stops reading and stores the state in CK

  • Transform passes Barrier without processing

  • Sink flushes buffered batch data upon receiving Barrier

States Saved in Different Phases

Snapshot Phase

Saved content:

public class SnapshotSplit { private final Object[] splitStart; // [1000] private final Object[] splitEnd; // [2000] private final Offset lowWatermark; // binlog.000011:1234 private final Offset highWatermark; // binlog.000011:5678 }

Restore logic:

Key code:

// IncrementalSourceReader.addSplits() for (SourceSplitBase split : splits) { if (split.isSnapshotSplit()) { SnapshotSplit snapshotSplit = split.asSnapshotSplit(); if (snapshotSplit.isSnapshotReadFinished()) { finishedUnackedSplits.put(splitId, snapshotSplit); // completed, skip } else { unfinishedSplits.add(split); // not completed, read again } } }

Incremental Phase

Saved content:

public class IncrementalSplit { private final Offset startupOffset; // current Binlog position private final List<CompletedSnapshotSplitInfo> completedSnapshotSplitInfos; // backfill state private final Map<TableId, byte[]> historyTableChanges; // Debezium history }

Restore logic:

// IncrementalSourceReader.initializedState() if (split.isIncrementalSplit()) { IncrementalSplit incrementalSplit = split.asIncrementalSplit(); // restore table schema debeziumDeserializationSchema.restoreCheckpointProducedType( incrementalSplit.getCheckpointTables() ); // continue consuming from startupOffset return new IncrementalSplitState(incrementalSplit); }

Checkpoint State Comparison

| Phase | Saved Content | Restore Method | Duplicate Risk | |----|----|----|----| | Snapshot | Split range + Watermarks | Re-execute unfinished Splits | Yes (Sink must be idempotent; repeated Select may query different snapshot points) | | Incremental | Binlog Offset + Table Schema | Continue from Offset | No |

Thus, it is recommended ​to avoid restoring or pausing tasks during full snapshot and backfill phases, as many unknown issues may arise.

4. Checkpoint Timeout

In practice, checkpoint (CK) timeout may occur even after 10–20 minutes. Why?

Analyzing CK and CDC tasks: long CK timeouts are usually due to ​insufficient write performance or misconfiguration on the Sink. Source only triggers CK to save minimal metadata quickly; the Sink must process all pending writes before the CK Barrier passes.

Checkpoint Timeout Mechanism Analysis

Checkpoint ensures Exactly-Once semantics; CK Barrier must propagate from source to sink, with all operators saving state.

  1. Source speed: only records read position and metadata (Split, Offset), usually milliseconds → fast CK trigger
  2. Sink blocking: must complete all pre-Barrier writes (e.g., 10,000 records)
  3. Timeout occurs: if Sink is slow → Barrier cannot pass within timeout (e.g., 10–20 min) → CK Timeout

Conclusion: prolonged timeout almost always means Sink cannot handle the backlog in time.

Solutions

1. Optimize Sink (most common)

MySQL

# JDBC URL add batch rewrite parameter jdbc:mysql://host:port/db?rewriteBatchedStatements=true&cachePrepStmts=true

Doris/StarRocks

# Use stream load mode with tuning parameters sink { Doris { sink.enable-2pc = true sink.buffer-size = 1048576 sink.buffer-count = 3 } }

PostgreSQL

sink { Jdbc { # Use COPY mode instead of INSERT use_copy_statement = true } }

2. Source-side throttling

env { job.mode = STREAMING # Limit read speed to give Sink time read_limit.rows_per_second = 4000 read_limit.bytes_per_second = 7000000 # Increase checkpoint timeout checkpoint.interval = 30000 checkpoint.timeout = 600000 }

Closing Remarks

CDC technology is indeed complex, involving many aspects of distributed systems: parallelism control, state management, fault tolerance, exactly-once semantics, and a deep understanding of databases. SeaTunnel, building on Debezium, has implemented numerous engineering optimizations, fixed multiple bugs, and its architecture is friendly for newcomers. Whether fixing documentation or directly debugging code, it is relatively easy to get started.

We warmly welcome contributors to join the community!

This article aims to help you better understand SeaTunnel CDC’s internal mechanisms, reduce pitfalls in production, and improve tuning. Any questions or discovered errors are welcome for discussion and correction.

Finally, wishing all CDC tasks run stably without interruption, and checkpoints never time out again!

\

Market Opportunity
Robinhood Logo
Robinhood Price(HOOD)
$0.00000801
$0.00000801$0.00000801
-2.11%
USD
Robinhood (HOOD) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Crucial Fed Rate Cut: Powell’s Bold Risk Management Move Explained

Crucial Fed Rate Cut: Powell’s Bold Risk Management Move Explained

BitcoinWorld Crucial Fed Rate Cut: Powell’s Bold Risk Management Move Explained In a significant development for global financial markets, Federal Reserve Chair Jerome Powell recently described the latest Fed rate cut as a critical risk management measure. This statement immediately captured the attention of investors, economists, and especially those in the dynamic cryptocurrency space. Understanding Powell’s rationale and the potential implications of this move is essential for navigating today’s complex economic landscape. What Exactly is a Fed Rate Cut and Why Does it Matter? A Fed rate cut refers to the Federal Reserve lowering the target range for the federal funds rate. This is the interest rate at which commercial banks borrow and lend their excess reserves to each other overnight. When the Fed lowers this rate, it typically makes borrowing cheaper across the entire economy. This decision impacts everything from mortgage rates to business loans. The Fed uses interest rates as a primary tool to influence economic activity, aiming to achieve maximum employment and stable prices. A lower rate often stimulates spending and investment, but it can also signal concerns about economic slowdown. Key reasons for a rate cut often include: Slowing economic growth or recession fears. Low inflation or deflationary pressures. Global economic instability impacting domestic markets. A desire to provide more liquidity to the financial system. Powell’s emphasis on ‘risk management’ suggests a proactive approach. The Fed is not just reacting to current data but also anticipating potential future challenges. They are essentially trying to prevent a worse economic outcome by adjusting policy now. How Does a Fed Rate Cut Influence the Broader Economy? When the Federal Reserve implements a Fed rate cut, it sends ripples throughout the financial world. For traditional markets, lower interest rates generally mean: Boost for Stocks: Companies can borrow more cheaply, potentially increasing profits and stock valuations. Investors might also move money from lower-yielding bonds into equities. Cheaper Borrowing: Consumers and businesses enjoy lower rates on loans, from mortgages to credit cards, encouraging spending and investment. Weaker Dollar: Lower rates can make a country’s currency less attractive to foreign investors, potentially leading to a weaker dollar. Bond Market Shifts: Existing bonds with higher yields become more attractive, while newly issued bonds will have lower yields. This shift in monetary policy aims to inject confidence and liquidity into the system, countering potential economic headwinds. However, there’s always a delicate balance to strike, as too much stimulus can lead to inflationary pressures down the line. What Does This Fed Rate Cut Mean for Cryptocurrency Investors? The impact of a Fed rate cut on the cryptocurrency market is often a topic of intense discussion. While crypto assets operate independently of central banks, they are not immune to broader macroeconomic forces. Here’s how a rate cut can play out: Increased Risk Appetite: With traditional savings and bond yields potentially lower, investors might seek higher returns in riskier assets, including cryptocurrencies like Bitcoin and Ethereum. Inflation Hedge Narrative: Some view cryptocurrencies, particularly Bitcoin, as a hedge against inflation and traditional currency debasement. If a rate cut leads to concerns about inflation, this narrative could gain traction. Liquidity Influx: A more accommodative monetary policy can increase overall liquidity in the financial system, some of which may flow into digital assets. Dollar Weakness: A weaker dollar, a potential consequence of rate cuts, can sometimes make dollar-denominated assets like crypto more appealing to international investors. However, it’s crucial to remember that the crypto market also has its unique drivers, including technological developments, regulatory news, and market sentiment. While a Fed rate cut can provide a tailwind, it’s not the sole determinant of crypto performance. Navigating the New Landscape: Actionable Insights for Crypto Investors Given the Federal Reserve’s stance on risk management through a Fed rate cut, what steps can crypto investors consider? Stay Informed: Keep a close watch on further Fed announcements and economic data. Understanding the broader macroeconomic picture is vital. Diversify Your Portfolio: While a rate cut might favor risk assets, a balanced portfolio that includes a mix of traditional and digital assets can help mitigate volatility. Long-Term Perspective: Focus on the fundamental value and long-term potential of your chosen cryptocurrencies rather than short-term fluctuations driven by macro news. Assess Risk Tolerance: Re-evaluate your personal risk tolerance in light of potential market shifts. Lower rates can encourage speculation, but prudence remains key. Powell’s description of the Fed rate cut as a risk management measure highlights the central bank’s commitment to maintaining economic stability. For cryptocurrency enthusiasts, this move underscores the increasing interconnectedness of traditional finance and the digital asset world. While a rate cut can create opportunities, a thoughtful and informed approach is always the best strategy. Frequently Asked Questions (FAQs) What exactly is a Fed rate cut? A Fed rate cut is when the Federal Reserve lowers its target for the federal funds rate, which is the benchmark interest rate banks charge each other for overnight lending. This action makes borrowing cheaper across the economy, aiming to stimulate economic activity. Why did Powell emphasize “risk management” for this Fed rate cut? Jerome Powell emphasized “risk management” to indicate that the Fed was proactively addressing potential economic slowdowns or other future challenges. It suggests a preventative measure to safeguard against adverse economic conditions rather than merely reacting to existing problems. How does a Fed rate cut typically affect the crypto market? A Fed rate cut can make traditional investments less attractive due to lower yields, potentially driving investors towards higher-risk, higher-reward assets like cryptocurrencies. It can also increase overall market liquidity and strengthen the narrative of crypto as an inflation hedge. Should crypto investors change their strategy after a rate cut? While a rate cut can influence market dynamics, crypto investors should primarily focus on their long-term strategy, fundamental research, and risk tolerance. It’s wise to stay informed about macroeconomic trends but avoid making impulsive decisions based solely on a single policy change. What are the potential downsides of a Fed rate cut? Potential downsides include increased inflationary pressures if the economy overheats, a weaker national currency, and the possibility of creating asset bubbles as investors chase higher returns in riskier markets. It can also signal underlying concerns about economic health. Did you find this article insightful? Share your thoughts and help others understand the implications of the Fed’s latest move! Follow us on social media for more real-time updates and expert analysis. To learn more about the latest crypto market trends, explore our article on key developments shaping Bitcoin’s price action. This post Crucial Fed Rate Cut: Powell’s Bold Risk Management Move Explained first appeared on BitcoinWorld.
Share
Coinstats2025/09/18 16:40
Motive Files Registration Statement for Proposed Initial Public Offering

Motive Files Registration Statement for Proposed Initial Public Offering

SAN FRANCISCO–(BUSINESS WIRE)–Motive Technologies, Inc., the AI platform for physical operations, today announced that it has filed a registration statement on
Share
AI Journal2025/12/24 07:00
New Gold Protocol's NGP token was exploited and attacked, resulting in a loss of approximately $2 million.

New Gold Protocol's NGP token was exploited and attacked, resulting in a loss of approximately $2 million.

PANews reported on September 18th that according to Paidun monitoring, New Gold Protocol's NGP token was exploited in an attack, resulting in a loss of approximately $2 million. The NGP token plummeted 88% in an hour, and the attacker deposited the stolen funds (443.8 ETH) into TornadoCash.
Share
PANews2025/09/18 11:10