Skip to content

feat(import): add support for multiple hbase snapshot imports#4600

Open
tianlei2 wants to merge 12 commits intogoogleapis:mainfrom
tianlei2:dataflow-import
Open

feat(import): add support for multiple hbase snapshot imports#4600
tianlei2 wants to merge 12 commits intogoogleapis:mainfrom
tianlei2:dataflow-import

Conversation

@tianlei2
Copy link
Copy Markdown

@tianlei2 tianlei2 commented Apr 28, 2026

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

b/429250716

This is the first PR that incorporates changes from https://github.com/jhambleton/java-bigtable-hbase/commits/dataflow-v2-v2.15.6 and some fixes to make it pass the tests.

  • Fixed Test Isolation Issues
    SnapshotUtilsTest.testGetHbaseConfiguration was failing because the static configuration field SnapshotUtils.hbaseConfiguration cached state between test cases, leaking stale data into subsequent tests.
    • Solution: Added a @before setup method to reset the static field to null via reflection before every test run.
  • Fixed Timestamp Formatting Tests
    SnapshotUtilsTest.testAppendCurrentTimestamp was throwing a NumberFormatException because the return value contained a UUID suffix (timestamp-UUID), but the test attempted to parse the entire string directly as a Long.
    • Solution: Updated the test to split the string using the "-" character to extract and correctly parse just the timestamp prefix.
  • Resolved Classpath and SPI Conflicts (dnsjava)
    Integration tests failed on Java 8 and 11 in Kokoro because of unshaded transitive dependency conflicts (com.google.protobuf.LiteralByteString NoClassDefFoundError).
    • Solution: Reverted back to the shaded hbase-shaded-mapreduce dependency, ensuring proper compatibility across all Java versions.
  • Uncommented and Fixed Tests in ImportJobFromHbaseSnapshotTest
    Several useful unit tests were commented out in ImportJobFromHbaseSnapshotTest because mockito-core lacked the ability to mock static methods.
    • Solution:
      Switched from mockito-core to mockito-inline in the pom.xml to allow static mocking.
      Uncommented the code and restored the original formatting to prevent any lint errors, enabling JUnit to verify correct configuration parsing.
  • ComputeAndValidateHashFromBigtableDoFnTest.java was accidentally deleted, adding back
  • Cleanups on unused comments

@tianlei2 tianlei2 requested a review from a team as a code owner April 28, 2026 18:41
@product-auto-label product-auto-label Bot added size: xl Pull request size is extra large. api: bigtable Issues related to the googleapis/java-bigtable-hbase API. labels Apr 28, 2026
@tianlei2 tianlei2 force-pushed the dataflow-import branch 3 times, most recently from f5324bd to f8b5932 Compare April 28, 2026 19:22
@tianlei2 tianlei2 changed the title Dataflow import feat(import): add support for multiple hbase snapshot imports Apr 28, 2026
@tianlei2 tianlei2 force-pushed the dataflow-import branch 3 times, most recently from 13f65dc to a3a6c7f Compare April 28, 2026 20:17
@tianlei2 tianlei2 marked this pull request as draft April 28, 2026 20:42
@tianlei2 tianlei2 marked this pull request as ready for review April 28, 2026 22:41
@tianlei2 tianlei2 marked this pull request as draft April 28, 2026 22:42
@tianlei2 tianlei2 added the kokoro:run Add this label to force Kokoro to re-run the tests. label Apr 29, 2026
@tianlei2 tianlei2 self-assigned this Apr 29, 2026
@yoshi-kokoro yoshi-kokoro removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Apr 29, 2026
@tianlei2 tianlei2 force-pushed the dataflow-import branch 4 times, most recently from 12f1c11 to d511a61 Compare April 29, 2026 19:46
@tianlei2 tianlei2 requested a review from vermas2012 April 29, 2026 20:27
@googleapis googleapis deleted a comment from google-cla Bot Apr 29, 2026
@tianlei2 tianlei2 force-pushed the dataflow-import branch from 86d5318 to f1125d8 Compare May 6, 2026 16:46
@tianlei2 tianlei2 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2026
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2026
@tianlei2 tianlei2 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2026
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 6, 2026
Copy link
Copy Markdown
Member

@vermas2012 vermas2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for including tests for the config and utility classes! However, before we can merge this, we need unit test coverage for the core pipeline transformations—specifically ReadRegions, ListRegions, and HbaseRegionSplitTracker. Because ReadRegions handles complex dynamic splitting and large-cell filtering, we need to ensure that logic is verified. We should also add a standard coder test for RegionConfigCoder to prevent serialization bugs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the package is not empty since it contains RegionConfigCoder. Maybe we can remove it if we move the RegionConfigCoder somewhere else?

try {
cleanupSnapshot(snapshotConfig);
} catch (Exception ex) {
LOG.error(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the implication of swallowing an exception here? is it possible to retry the cleaup and make sure we don't leave restored snaphsot in GCS?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems too big having 3-4 objects each doing a specific task, see if we can split this into 3-4 files. Each of then with their own unit tests. I will leave another comment for testing in the PR.

pipeline
.apply("CreateInput", Create.of(SnapshotTestHelper.newSnapshotConfig("invalid_path")))
.apply("DeleteSnapshot", ParDo.of(new CleanupRestoredSnapshots()));
pipeline.run();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we asserting here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment here. the pipeline should run successfully without throwing an exception (for CleanupRestoredSnapshots)

.withTableId(opts.getBigtableTableId())
.withConfiguration(BigtableOptionsFactory.CUSTOM_USER_AGENT_KEY, customUserAgent);
.withConfiguration(BigtableOptionsFactory.CUSTOM_USER_AGENT_KEY, customUserAgent)
.withConfiguration(BigtableOptionsFactory.MAX_INFLIGHT_RPCS_KEY, "100")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 100 and 30 ms later is hardcoded with no way to change it, are we sure this work for all imports, otherwise, we should expose them as options that we can set from the job config.

}

public static Configuration getHBaseConfiguration(Map<String, String> configurations) {
if (hbaseConfiguration == null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This double checked locking requires the hbaseConfiguration object to be volatile to work

Note on the volatile keyword:
Adding volatile here is strictly required for thread safety. Without it, the Java compiler is allowed to reorder the object initialization and the reference assignment. This means another thread could bypass the null check and receive a partially constructed object.

This was a well-documented flaw in early Java that was fixed in JSR-133 by enforcing a strict happens-before guarantee on volatile reads/writes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

this.enableDynamicSplitting = enableDynamicSplitting;
}

public ByteKeyRange currentRestriction() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing @OverRide, same for other methods like tryClaim

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the size of snapshot being imported? How many shards are we setting? i don't see teh shard configs, so maybe we are not setting the shards, we need to set them, to make sure we are importing the whole snaphshot via the multiple jobs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a sharded test. snapshot file size is 78kb

@tianlei2 tianlei2 force-pushed the dataflow-import branch 6 times, most recently from 738ee25 to 682753c Compare May 7, 2026 15:26
@tianlei2 tianlei2 force-pushed the dataflow-import branch 2 times, most recently from 4bd17e1 to b49307c Compare May 7, 2026 16:04
@tianlei2 tianlei2 force-pushed the dataflow-import branch 2 times, most recently from f84da91 to caace0b Compare May 7, 2026 16:58
@tianlei2 tianlei2 force-pushed the dataflow-import branch from 4e91349 to 5134e4b Compare May 7, 2026 17:30
@tianlei2 tianlei2 force-pushed the dataflow-import branch from 5134e4b to 808dbac Compare May 7, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigtable Issues related to the googleapis/java-bigtable-hbase API. size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants