Skip to content

feat binlog apply optimization#1378

Open
shaohk wants to merge 5 commits into
github:masterfrom
shaohk:feat-binlog-apply-optimization
Open

feat binlog apply optimization#1378
shaohk wants to merge 5 commits into
github:masterfrom
shaohk:feat-binlog-apply-optimization

Conversation

@shaohk
Copy link
Copy Markdown
Contributor

@shaohk shaohk commented Feb 5, 2024

Description

  1. Support for ignoring binlog events that exceed chunk boundary values.
    If the value corresponding to the unique key of the binlog exceeds the maximum value of chunk iteration and is less than the maximum boundary value of the copy, it is ignored.

  2. Support for binlog merge processing.
    When the columns of the unique key selected by chunk are int or float, the binlog is processed by map merging, and all delete operations are merged into one delete sql.All insert and update operations are merged into one replace sql. Then execute these sql in db transaction.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

shaohoukun added 2 commits February 5, 2024 18:21
1. Support for ignoring binlog events that exceed chunk boundary values. 2. Support for binlog merge
processing.
@shaohk shaohk changed the title (feat): Feat binlog apply optimization Feat binlog apply optimization Feb 5, 2024
shaohoukun added 3 commits February 6, 2024 14:39
When there is only one column in a unique index, merging of DML binlog events is permitted.
…tionRangeMaxValues is nil

Add handling for the case when MigrationIterationRangeMaxValues is nil
@dnovitski
Copy link
Copy Markdown
Contributor

Correctness Issues Found + Fix PR

I reviewed this PR in detail and found several correctness bugs in the merge DML implementation. I've submitted a fix as PR #1687 (built on top of this branch).

Critical Issues

  1. generateReplaceQuery reads live source stateREPLACE INTO ghost SELECT ... FROM source WHERE pk IN (...) reads the current source row, not the binlog event's row image. This is incorrect when:

    • Events are buffered during throttle (source may have changed since event time)
    • gh-ost is killed and restarted (replay uses current state, not original)
    • Cutover happens before all events drain
  2. isIgnoreOverMaxChunkRangeEvent returns true when nil — Before the first row-copy chunk, MigrationIterationRangeMaxValues is nil, causing the comparison function to return true (ignore event). Early events are silently dropped.

  3. Missing column conversion tokens — If using plain ? placeholders, timezone (convert_tz), enum (ELT), and JSON (convert(? using utf8mb4)) columns would be inserted without their required conversions.

Fix (PR #1687)

  • Uses REPLACE INTO ghost (cols) VALUES (?, ?) with binlog after-image values
  • Handles column conversions via BuildColumnsPreparedValues
  • Fixes nil check to return false (don't ignore)
  • Introduces mergedEntry struct for safe map value handling

Performance Results

The fix maintains the performance benefits:

  • 10% faster general workload, 39% faster hot-spot dedup, 2.25× sustained throughput vs non-merge mode

See #1687 for full details and benchmarks.

dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Port the merge-DML batching optimization from PR github#1378 to current
master, adapting it to the refactored builder-pattern API.

When --is-merge-dml-event is enabled and the unique key is
memory-comparable (all numeric columns):

- Deduplicates DML events by unique key (latest event wins)
- Cancels INSERT+DELETE pairs for same key (net no-op)
- Batches remaining INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches remaining DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (already copied by row-copy)

Uses BuildColumnsPreparedValues for proper per-column conversion
tokens (convert_tz, ELT, etc.) preventing data corruption for
timezone, enum, and JSON columns.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Port the merge-DML batching optimization from PR github#1378 to current
master, adapting it to the refactored builder-pattern API.

When --is-merge-dml-event is enabled and the unique key is
memory-comparable (all numeric columns):

- Deduplicates DML events by unique key (latest event wins)
- Cancels INSERT+DELETE pairs for same key (net no-op)
- Batches remaining INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches remaining DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (already copied by row-copy)

Uses BuildColumnsPreparedValues for proper per-column conversion
tokens (convert_tz, ELT, etc.) preventing data corruption for
timezone, enum, and JSON columns.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Port the merge-DML batching optimization from PR github#1378 to current
master, adapting it to the refactored builder-pattern API.

When --is-merge-dml-event is enabled and the unique key is
memory-comparable (all numeric columns):

- Deduplicates DML events by unique key (latest event wins)
- Cancels INSERT+DELETE pairs for same key (net no-op)
- Batches remaining INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches remaining DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (already copied by row-copy)

Uses BuildColumnsPreparedValues for proper per-column conversion
tokens (convert_tz, ELT, etc.) preventing data corruption for
timezone, enum, and JSON columns.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Port the merge-DML batching optimization from PR github#1378 to current
master, adapting it to the refactored builder-pattern API.

When --is-merge-dml-event is enabled and the unique key is
memory-comparable (all numeric columns):

- Deduplicates DML events by unique key (latest event wins)
- Cancels INSERT+DELETE pairs for same key (net no-op)
- Batches remaining INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches remaining DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (already copied by row-copy)

Uses BuildColumnsPreparedValues for proper per-column conversion
tokens (convert_tz, ELT, etc.) preventing data corruption for
timezone, enum, and JSON columns.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohoukun <shaohoukun@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohk <shaohoukun@meituan.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dnovitski added a commit to dnovitski/gh-ost that referenced this pull request May 22, 2026
Add --is-merge-dml-event flag that batches and deduplicates binlog DML
events before applying them to the ghost table, significantly reducing
SQL round-trips during high-write migrations.

When enabled and the unique key is memory-comparable (numeric columns):
- Deduplicates DML events by unique key (latest event wins)
- Reduces INSERT+DELETE sequences to DELETE (safe against row-copy races)
- Batches INSERTs/UPDATEs as multi-row REPLACE INTO
- Batches DELETEs as DELETE WHERE (pk) IN (...)
- Skips events beyond migration range (not yet copied by row-copy)
- Disables merge for tables with secondary unique indexes

Safety: strict numeric type validation in formatNumericValue prevents
SQL injection. Type detection uses exact base-type parsing (not substring).
Uses BuildColumnsPreparedValues for proper per-column conversion tokens.

Original implementation by shaohoukun in PR github#1378, adapted to current
master's builder-pattern API with correctness and security hardening.

Co-authored-by: shaohk <shaohoukun@meituan.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants