(Not yet upstream) Reset to last successful cell on EOF with WAL Compression #236

sidkhillon · 2026-02-09T19:30:02Z

Description

When replication tails a WAL with compression enabled, hitting EOF mid-entry currently triggers EOF_AND_RESET_COMPRESSION. This clears all compression dictionaries, reopens the file from the beginning, and re-reads every entry to rebuild dictionary state — an O(n) operation that grows more expensive the further into the file we are.

This change defers dictionary updates in CompressedKvDecoder until an entire cell is successfully parsed. If EOF interrupts mid-cell, the pending additions are discarded and all dictionaries remain in the exact state they were after the last fully-read cell. This allows ProtobufWALTailingReader to return EOF_AND_RESET instead of EOF_AND_RESET_COMPRESSION, replacing the O(n) replay with an O(1) seek to the last cell boundary.

How it works

CompressedKvDecoder.parseCell() now buffers dictionary additions in a pending list, and puts TagCompressionContext into deferred mode. On successful parse, all additions are committed. On failure, they're discarded.

ProtobufWALTailingReader switches from batch cell reading (readFromCells) to cell-by-cell reading, tracking the stream position before each cell. On EOF mid-entry, it saves the partially-read entry, remaining cell count, and resume position. On the next next() call after reset, it resumes reading only the remaining cells.

Dictionary interface gains a size() method so TagCompressionContext can resolve within-cell dictionary references against deferred entries (when a later tag in the same cell was encoded as a dictionary hit referencing an earlier tag that hasn't been committed yet). This is defensive -- it is currently not possible to have a Put with two tags of the same byte on the same cell, but IMO it's better to be safe here.

Tests

TestCompressedKvDecoderDeferredDictUpdates — sweeps every byte truncation point across encoded cells, verifying dictionary state matches a clean reference. Covers plain, tag, and value compression variants, plus a cell-boundary resume test.
TestWALTailingReaderPartialCellResume — integration tests with real WAL files verifying EOF_AND_RESET state, single resume, and multiple consecutive partial reads.

sidkhillon · 2026-02-10T21:39:08Z

This hasn't been reviewed with an upstream PR, so I'd really appreciate any feedback on this change.

I have verified this hasn't caused any data loss by doing a full table comparison after a few hours of running this in QA.

charlesconnell

I recommend changing the title+desc to make clear that this affects tag compression only, which uses a different scheme than compression on other parts of the WAL.

I'm a little unconvinced that this is the best approach to solve the problem. Dictionary is just a glorified map. Additions to the map are effectively atomic (you're not adding half a tag value to it, it's all or nothing). What condition makes it contain invalid data?

sidkhillon · 2026-02-10T22:00:44Z

I recommend changing the title+desc to make clear that this affects tag compression only

This does not only affect tag compression. Rather, the solution affects all compression, including both key and value compression. Those changes are primarily in WalCellCodec

sidkhillon · 2026-02-10T22:01:33Z

Dictionary is just a glorified map. Additions to the map are effectively atomic (you're not adding half a tag value to it, it's all or nothing). What condition makes it contain invalid data?

The Dictionary is using a LRU cache to save data. Therefore, if we read a partial entry, we can read only some cells. This will cause the oldest cells to be evicted, resulting in a "corrupt" cache if we simply seek back to the old starting point.

sidkhillon · 2026-02-10T22:17:47Z

A relevant PR and JIRA is here, which introduced support for WAL Compression and Replication to be turned on simultaneously. This PR attempts to speed up that process.

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java

hgromer · 2026-02-11T19:40:05Z

hbase-common/src/main/java/org/apache/hadoop/hbase/io/TagCompressionContext.java

+    }
+    try {
+      return tagDict.getEntry(dictIdx);
+    } catch (IndexOutOfBoundsException e) {


ooc -- why is this necessary now?

If we were referencing a tag that was previously in the cell, we would need to search the deferred additions to find that tag. This handles that case.

Theoretically, this should be impossible because there should only be at most one of each tag, but this was some defensive programming just in case.

hgromer · 2026-02-11T19:41:30Z

hbase-common/src/main/java/org/apache/hadoop/hbase/io/TagCompressionContext.java

 public class TagCompressionContext {
  private final Dictionary tagDict;
+  private boolean deferAdditions = false;
+  private List<byte[]> deferredAdditions;


If you initialize this to an empty ArrayList rather than null, I think it simplifies some of the logic below

Good call, fixed.

hgromer · 2026-02-11T19:55:08Z

Dictionary is just a glorified map. Additions to the map are effectively atomic (you're not adding half a tag value to it, it's all or nothing). What condition makes it contain invalid data?

The Dictionary is using a LRU cache to save data. Therefore, if we read a partial entry, we can read only some cells. This will cause the oldest cells to be evicted, resulting in a "corrupt" cache if we simply seek back to the old starting point.

I know we had discussed this, but now I'm looking at the code and I'm forgetting why we can't just continue reading from the offset, and continue appending to the LRU.

When we read the values, first we read the length of the tag/value. We should only insert valid entries into the dictionary. Additionally, I'm forgetting why we can't just seek back to the position we were last at. It's not clear to me why the evicted value is important if we just pick up from where we left off.

We definitely discussed this yesterday but my memory is admittedly fuzzy, mind elaborating a bit?

sidkhillon · 2026-02-11T20:07:49Z

Dictionary is just a glorified map. Additions to the map are effectively atomic (you're not adding half a tag value to it, it's all or nothing). What condition makes it contain invalid data?

The Dictionary is using a LRU cache to save data. Therefore, if we read a partial entry, we can read only some cells. This will cause the oldest cells to be evicted, resulting in a "corrupt" cache if we simply seek back to the old starting point.

I know we had discussed this, but now I'm looking at the code and I'm forgetting why we can't just continue reading from the offset, and continue appending to the LRU.

When we read the values, first we read the length of the tag/value. We should only insert valid entries into the dictionary. Additionally, I'm forgetting why we can't just seek back to the position we were last at. It's not clear to me why the evicted value is important if we just pick up from where we left off.

We definitely discussed this yesterday but my memory is admittedly fuzzy, mind elaborating a bit?

Here's an example:

Before reading a cell, both writer and reader dictionaries are in sync:
Index 0: "row-0"
Index 1: "row-1"
Index 2: "row-2"
currSize = 3

Reader starts reading a cell. ROW = "row-3" is NOT_IN_DICTIONARY on the wire, so the reader reads the raw bytes and calls addEntry("row-3") — added at index 3, currSize = 4. Then EOF hits while reading FAMILY.

Reader seeks back to cell start and retries. The stream still starts with NOT_IN_DICTIONARY + "row-3". The reader calls addEntry("row-3") again — added at index 4, currSize = 5.

Now the reader has:
Index 0: "row-0"
Index 1: "row-1"
Index 2: "row-2"
Index 3: "row-3" ← from failed partial read
Index 4: "row-3" ← from retry (duplicate!)
currSize = 5

But the writer has:
Index 0: "row-0"
Index 1: "row-1"
Index 2: "row-2"
Index 3: "row-3"
currSize = 4

From this point on, every index the writer assigns is off-by-one from what the reader expects. The next new entry the writer encodes goes at index 4, but the reader's index 4 is the duplicate "row-3". The reader will decode garbage or throw.

The same problem gets worse at capacity — extra entries cause different LRU evictions, so the writer and reader disagree on which entry gets evicted and which index gets reused.

sidkhillon · 2026-02-11T20:49:21Z

For reference, I have now opened an upstream PR here.

hgromer · 2026-02-11T20:51:28Z

ah yup, thanks for writing out the example

skhillon added 7 commits February 6, 2026 12:12

Initial changes to allow partial cell read in WAL

1cf02a3

Add tests

18a8a33

More tests

48c95f3

Add back comments

6a01534

Remove redundant tests and simplify

4f1ab7f

Get tag value even if tag is out of bounds due to addition

efc1477

Clean up a bit

3aa4c44

sidkhillon changed the title ~~Reset last read~~ (Not yet upstream) Reset to last successful cell on EOF with WAL Compression Feb 10, 2026

sidkhillon requested review from charlesconnell, hgromer, krconv and rmdmattingly February 10, 2026 21:38

sidkhillon marked this pull request as ready for review February 10, 2026 21:38

charlesconnell reviewed Feb 10, 2026

View reviewed changes

hgromer reviewed Feb 11, 2026

View reviewed changes

hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java Show resolved Hide resolved

Add explanatory comment

1c18805

hgromer reviewed Feb 11, 2026

View reviewed changes

Initialize to empty

2cd920c

(Not yet upstream) Reset to last successful cell on EOF with WAL Compression #236

Are you sure you want to change the base?

(Not yet upstream) Reset to last successful cell on EOF with WAL Compression #236

Conversation

sidkhillon commented Feb 9, 2026

Description

How it works

Uh oh!

sidkhillon commented Feb 10, 2026

Uh oh!

charlesconnell left a comment

Choose a reason for hiding this comment

Uh oh!

sidkhillon commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidkhillon commented Feb 10, 2026

Uh oh!

sidkhillon commented Feb 10, 2026

Uh oh!

Uh oh!

hgromer Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

sidkhillon Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

hgromer Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

sidkhillon Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

hgromer commented Feb 11, 2026

Uh oh!

sidkhillon commented Feb 11, 2026

Uh oh!

sidkhillon commented Feb 11, 2026

Uh oh!

hgromer commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sidkhillon commented Feb 10, 2026 •

edited

Loading