feat(import): add support for multiple hbase snapshot imports#4600
feat(import): add support for multiple hbase snapshot imports#4600tianlei2 wants to merge 12 commits intogoogleapis:mainfrom
Conversation
f5324bd to
f8b5932
Compare
13f65dc to
a3a6c7f
Compare
…sjava SPI conflict
12f1c11 to
d511a61
Compare
d511a61 to
5ec8dc1
Compare
4299c64 to
86d5318
Compare
vermas2012
left a comment
There was a problem hiding this comment.
Thanks for including tests for the config and utility classes! However, before we can merge this, we need unit test coverage for the core pipeline transformations—specifically ReadRegions, ListRegions, and HbaseRegionSplitTracker. Because ReadRegions handles complex dynamic splitting and large-cell filtering, we need to ensure that logic is verified. We should also add a standard coder test for RegionConfigCoder to prevent serialization bugs.
There was a problem hiding this comment.
the package is not empty since it contains RegionConfigCoder. Maybe we can remove it if we move the RegionConfigCoder somewhere else?
| try { | ||
| cleanupSnapshot(snapshotConfig); | ||
| } catch (Exception ex) { | ||
| LOG.error( |
There was a problem hiding this comment.
What is the implication of swallowing an exception here? is it possible to retry the cleaup and make sure we don't leave restored snaphsot in GCS?
There was a problem hiding this comment.
This file seems too big having 3-4 objects each doing a specific task, see if we can split this into 3-4 files. Each of then with their own unit tests. I will leave another comment for testing in the PR.
| pipeline | ||
| .apply("CreateInput", Create.of(SnapshotTestHelper.newSnapshotConfig("invalid_path"))) | ||
| .apply("DeleteSnapshot", ParDo.of(new CleanupRestoredSnapshots())); | ||
| pipeline.run(); |
There was a problem hiding this comment.
added a comment here. the pipeline should run successfully without throwing an exception (for CleanupRestoredSnapshots)
| .withTableId(opts.getBigtableTableId()) | ||
| .withConfiguration(BigtableOptionsFactory.CUSTOM_USER_AGENT_KEY, customUserAgent); | ||
| .withConfiguration(BigtableOptionsFactory.CUSTOM_USER_AGENT_KEY, customUserAgent) | ||
| .withConfiguration(BigtableOptionsFactory.MAX_INFLIGHT_RPCS_KEY, "100") |
There was a problem hiding this comment.
This 100 and 30 ms later is hardcoded with no way to change it, are we sure this work for all imports, otherwise, we should expose them as options that we can set from the job config.
| } | ||
|
|
||
| public static Configuration getHBaseConfiguration(Map<String, String> configurations) { | ||
| if (hbaseConfiguration == null) { |
There was a problem hiding this comment.
This double checked locking requires the hbaseConfiguration object to be volatile to work
Note on the volatile keyword:
Adding volatile here is strictly required for thread safety. Without it, the Java compiler is allowed to reorder the object initialization and the reference assignment. This means another thread could bypass the null check and receive a partially constructed object.
This was a well-documented flaw in early Java that was fixed in JSR-133 by enforcing a strict happens-before guarantee on volatile reads/writes.
- Reference: Baeldung: Double-Checked Locking with Singleton
- Official Sonar Rule: RSPEC-2168
| this.enableDynamicSplitting = enableDynamicSplitting; | ||
| } | ||
|
|
||
| public ByteKeyRange currentRestriction() { |
There was a problem hiding this comment.
missing @OverRide, same for other methods like tryClaim
There was a problem hiding this comment.
what is the size of snapshot being imported? How many shards are we setting? i don't see teh shard configs, so maybe we are not setting the shards, we need to set them, to make sure we are importing the whole snaphshot via the multiple jobs.
There was a problem hiding this comment.
add a sharded test. snapshot file size is 78kb
738ee25 to
682753c
Compare
4bd17e1 to
b49307c
Compare
f84da91 to
caace0b
Compare
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
b/429250716
This is the first PR that incorporates changes from https://github.com/jhambleton/java-bigtable-hbase/commits/dataflow-v2-v2.15.6 and some fixes to make it pass the tests.
SnapshotUtilsTest.testGetHbaseConfiguration was failing because the static configuration field SnapshotUtils.hbaseConfiguration cached state between test cases, leaking stale data into subsequent tests.
SnapshotUtilsTest.testAppendCurrentTimestamp was throwing a NumberFormatException because the return value contained a UUID suffix (timestamp-UUID), but the test attempted to parse the entire string directly as a Long.
Integration tests failed on Java 8 and 11 in Kokoro because of unshaded transitive dependency conflicts (com.google.protobuf.LiteralByteString NoClassDefFoundError).
Several useful unit tests were commented out in ImportJobFromHbaseSnapshotTest because mockito-core lacked the ability to mock static methods.
Switched from mockito-core to mockito-inline in the pom.xml to allow static mocking.
Uncommented the code and restored the original formatting to prevent any lint errors, enabling JUnit to verify correct configuration parsing.