HIVE-29536: Stabilize rebalance compaction tests by InvisibleProgrammer · Pull Request #6487 · apache/hive

InvisibleProgrammer · 2026-05-15T07:34:17Z

Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:

There are two causes identified:

Firstly, the number of buckets and even the order of the elements inside a bucket depends on the version string of Orc: https://issues.apache.org/jira/browse/HIVE-29536?focusedCommentId=18080335&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-18080335 (Thanks @thomasrebele to digging into it)
Secondly, the base directory can change as well (like here: 1a90d27#diff-dedd154465fd42855d9d6710d54553660dae87405ce2e4ea931475de1d5bb816L199)

What changes were proposed in this pull request?

The goal of the change is to stabilize those tests by doing two things:

Rebalance assertions are not hard-coded. Instead of that, we can check if the buckets are balanced or not and if all the data is available.
Base folder can be searched dinamically

Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.

Why are the changes needed?

We experienced regular and serious regression issues due to the effect of the orc version number.

Does this PR introduce any user-facing change?

No

How was this patch tested?

With the existing tests.

thomasrebele

Thank you for working on the fix! I've added some suggestions and requests for improving it.

thomasrebele · 2026-05-15T11:57:53Z

+     */
+
+    int optimalRecordsInBucket = expectedData.size() / bucketCount;
+    int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;


I think a different formula is needed here. For example, if we have 11 rows and 10 buckets, the optimal records in bucket would be 11/10 = 1. Then maximumRecordCountInABucket = 1 + 10 - 1 = 10. If one bucket contains 10 rows, then another bucket contains 1 row, and the remaining 8 buckets are empty. I would not call that distribution "balanced".

It makes totally sense. Let me think about that. What about maximumRecordCountInABucket = 2 * optimalRecordsInBucket - 1?

It wouldn't be totally accurate as your edge case is about what if the optimal bucket count is exactly 1. We have no test case for that and it more a theoretical than a practical question: for buckets with optimal element size 1, we should allow having buckets with 2 elements. But I cannot imagine a practical use case when it actually happens. For every other bucket size, 2 x bucketsize - 1 sounds fine: it allows 'bucket overflow', so when the row_id division happens, we have space to put the remainders. But it also says if we have 2x bucketsize number of elements, there is no way the table is balanced.

Does it makes sense?

I think you meant "if the optimal row count is exactly 1". I just used these numbers to make my point clear. If we distribute the 11 rows to 5 buckets, the optimalRecordsInBucket would be 2. Then maximumRecordCountInABucket = 2 + 5 - 1 = 6. So we could get a distribution of 6 rows in the first bucket, 5 rows in the second, and the remaining 3 buckets are empty.

As far as I understood the rebalancing, we want to redistribute the $n :=$ allRecordCount rows to the $m :=$ bucketCount buckets, so that the number of rows of two buckets differs by at most 1. This is slightly related to the Pigeonhole principle.

If we distribute $n$ rows to $m$ buckets under the above condition, the buckets have at least $\Big\lfloor \frac{n}{m} \Big\rfloor$ rows and at most $\Big\lceil \frac{n}{m} \Big\rceil$ rows. Using an integer division /, we can rewrite $\Big\lceil \frac{n}{m} \Big\rceil$ as

int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount;

Afaik, this is the exact upper limit to expect if the buckets are balanced. I think this formula should be used here.

thomasrebele · 2026-05-15T11:58:56Z

+        .reduce(0, Integer::sum);
+
+    int optimalRecordsInBucket = allRecordCount / bucketCount;
+    int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;


See comment https://github.com/apache/hive/pull/6487/changes#r3248007538

thomasrebele · 2026-05-15T12:06:44Z

+    TestDataProvider testDataProvider = prepareRebalanceTestData(tableName);
+
+    //Try to do a rebalancing compaction
+    executeStatementOnDriver("ALTER TABLE " + tableName + " COMPACT 'rebalance'", driver);


Maybe add a comment that without explicit ORDER BY, there's an implicit order defined by the rebalance compaction query that fixes the order?

It is a test case. I wonder how it helps understanding rebalance compaction or the test case itself.

It is an information that I would really like as part of a full description of the feature, in the class that responsible for doing the feature and in the official documentation.

What do you think?

It would help understand why we assert the exact order of the rows is verified in this test case. I had wondered a short time, until I realized that the order is fixed.

thomasrebele · 2026-05-15T12:26:40Z

+
+    /*
+     check if the test data is unbalanced
+     balanced if all the buckets contains between n / bucket count and n / bucket count + bucket count rows


See comment https://github.com/apache/hive/pull/6487/changes#r3248007538.

Nit: contain (without s)

thomasrebele · 2026-05-15T12:29:53Z

+
+    // Assert that we have multiple buckets
+    List<String> bucketFilenames = CompactorTestUtil.getBucketFileNames(fs, table, null, "base_0000001");
+    assertTrue(bucketFilenames.size() > 1);


This assert should be removed (we're checking for == 1 later).

thomasrebele · 2026-05-15T12:39:05Z

+    Assert.assertFalse(isBalanced(tableName, testDataProvider));
+
+    // Please note, as the test tests rebalance compaction, not insert overwrite, it is not necessary to test if
+    // we have the exact same data after preparing the test data as we had at the source table.


I think it would still be helpful to check the data here, in order to catch problems from other parts of Hive early. Maybe simply check the number of records in the table.

Sure. Let me add a check on the number of records. Usually, I'm against doing assertions at the arrange part of the tests but as I saw, those tests are already full with them so I don't think if it can cause a misunderstanding.

thomasrebele · 2026-05-15T12:57:32Z

-      Assert.assertEquals(errorMessage, e.getCauseMessage());
-      Assert.assertEquals(ErrorMsg.COMPACTION_REFUSED.getErrorCode(), e.getErrorCode());
+      assertEquals(errorMessage, e.getCauseMessage());
+      assertEquals(ErrorMsg.COMPACTION_REFUSED.getErrorCode(), e.getErrorCode());


I would prefer to do refactors in a separate ticket/commit. The official guideline says "reformat code unrelated to the issue being fixed: formatting changes should be separate patches/commits;". While this is not strictly a formatting change, it feels quite close to it.

Errr... Sure. Let me check this. I suppose I have to reduce my IDE's capabilities and it can be done. Honestly, I just removed some code blocks from this class. Those changes were handled by the IDE and I didn't even notice them.

However, I think the official guideline is a little bit obsolete: I'm pretty sure I saw a discussion somewhere about the opposite, like please do not do trivial changes like small reformats, formatting, etc to reduce the load on the precommit jobs.

Hm, reducing the workload on the CI is indeed a valid concern. Is it possible to merge this PR with a rebase, instead of a squash? That way the import related changes could be move to a separate commit in the same PR. I wished Github had an interactive rebase feature for merging PRs.

I had a quick search, but couldn't find the discussion. Do you have a link by chance?

I tried to find it but couldn't :(

thomasrebele · 2026-05-15T13:05:48Z

+        assertTrue(expectedData.contains(rowData));
+        expectedData.remove(rowData);


To avoid looking up the element twice, the return value of Set#remove() could be used.

The assertion right before this line does the exact same check.

What I wanted to say:
With

assertTrue(expectedData.remove(rowData));

we could avoid the call to expectedData.contains(rowData). It's a bit less readable, so I'll leave it up to you to decide which one to take.

thomasrebele · 2026-05-21T07:52:17Z

+    TestDataProvider testDataProvider = prepareRebalanceTestData(tableName);
+
+    //Try to do a rebalancing compaction
+    executeStatementOnDriver("ALTER TABLE " + tableName + " COMPACT 'rebalance'", driver);


It would help understand why we assert the exact order of the rows is verified in this test case. I had wondered a short time, until I realized that the order is fixed.

thomasrebele · 2026-05-21T08:09:52Z

+     */
+
+    int optimalRecordsInBucket = expectedData.size() / bucketCount;
+    int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;


I think you meant "if the optimal row count is exactly 1". I just used these numbers to make my point clear. If we distribute the 11 rows to 5 buckets, the optimalRecordsInBucket would be 2. Then maximumRecordCountInABucket = 2 + 5 - 1 = 6. So we could get a distribution of 6 rows in the first bucket, 5 rows in the second, and the remaining 3 buckets are empty.

As far as I understood the rebalancing, we want to redistribute the $n :=$ allRecordCount rows to the $m :=$ bucketCount buckets, so that the number of rows of two buckets differs by at most 1. This is slightly related to the Pigeonhole principle.

If we distribute $n$ rows to $m$ buckets under the above condition, the buckets have at least $\Big\lfloor \frac{n}{m} \Big\rfloor$ rows and at most $\Big\lceil \frac{n}{m} \Big\rceil$ rows. Using an integer division /, we can rewrite $\Big\lceil \frac{n}{m} \Big\rceil$ as

int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount;

Afaik, this is the exact upper limit to expect if the buckets are balanced. I think this formula should be used here.

sonarqubecloud · 2026-05-21T09:33:30Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.7% Duplication on New Code

See analysis details on SonarQube Cloud

HIVE-29536: Stabilize rebalance compaction tests

36bcc44

asf-ci-hive added tests pending tests passed and removed tests pending labels May 15, 2026

thomasrebele suggested changes May 15, 2026

View reviewed changes

Address review comments

1ddc65d

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels May 20, 2026

Address SonarQube issues

13b6dba

asf-ci-hive added tests pending and removed tests passed labels May 21, 2026

thomasrebele suggested changes May 21, 2026

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels May 21, 2026

		assertTrue(expectedData.contains(rowData));
		expectedData.remove(rowData);

Conversation

InvisibleProgrammer commented May 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 21, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

thomasrebele May 21, 2026 •

edited

Loading