Skip to content

HIVE-29536: Stabilize rebalance compaction tests#6487

Open
InvisibleProgrammer wants to merge 3 commits into
apache:masterfrom
InvisibleProgrammer:fix_rebalance_tests_flakyness
Open

HIVE-29536: Stabilize rebalance compaction tests#6487
InvisibleProgrammer wants to merge 3 commits into
apache:masterfrom
InvisibleProgrammer:fix_rebalance_tests_flakyness

Conversation

@InvisibleProgrammer
Copy link
Copy Markdown
Contributor

Rebalance tests are sensitive and the hard-coded assertions need to be modified regularly.
Some examples:

There are two causes identified:

What changes were proposed in this pull request?

The goal of the change is to stabilize those tests by doing two things:

  • Rebalance assertions are not hard-coded. Instead of that, we can check if the buckets are balanced or not and if all the data is available.
  • Base folder can be searched dinamically

Please note: I also refactored the code little bit and extracted rebalance compaction tests into a new class.

Why are the changes needed?

We experienced regular and serious regression issues due to the effect of the orc version number.

Does this PR introduce any user-facing change?

No

How was this patch tested?

With the existing tests.

Copy link
Copy Markdown
Contributor

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on the fix! I've added some suggestions and requests for improving it.

*/

int optimalRecordsInBucket = expectedData.size() / bucketCount;
int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a different formula is needed here. For example, if we have 11 rows and 10 buckets, the optimal records in bucket would be 11/10 = 1. Then maximumRecordCountInABucket = 1 + 10 - 1 = 10. If one bucket contains 10 rows, then another bucket contains 1 row, and the remaining 8 buckets are empty. I would not call that distribution "balanced".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes totally sense. Let me think about that. What about maximumRecordCountInABucket = 2 * optimalRecordsInBucket - 1?

It wouldn't be totally accurate as your edge case is about what if the optimal bucket count is exactly 1. We have no test case for that and it more a theoretical than a practical question: for buckets with optimal element size 1, we should allow having buckets with 2 elements. But I cannot imagine a practical use case when it actually happens. For every other bucket size, 2 x bucketsize - 1 sounds fine: it allows 'bucket overflow', so when the row_id division happens, we have space to put the remainders. But it also says if we have 2x bucketsize number of elements, there is no way the table is balanced.

Does it makes sense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant "if the optimal row count is exactly 1". I just used these numbers to make my point clear. If we distribute the 11 rows to 5 buckets, the optimalRecordsInBucket would be 2. Then maximumRecordCountInABucket = 2 + 5 - 1 = 6. So we could get a distribution of 6 rows in the first bucket, 5 rows in the second, and the remaining 3 buckets are empty.

As far as I understood the rebalancing, we want to redistribute the $n :=$ allRecordCount rows to the $m :=$ bucketCount buckets, so that the number of rows of two buckets differs by at most 1. This is slightly related to the Pigeonhole principle.

If we distribute $n$ rows to $m$ buckets under the above condition, the buckets have at least $\Big\lfloor \frac{n}{m} \Big\rfloor$ rows and at most $\Big\lceil \frac{n}{m} \Big\rceil$ rows. Using an integer division /, we can rewrite $\Big\lceil \frac{n}{m} \Big\rceil$ as

int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount;

Afaik, this is the exact upper limit to expect if the buckets are balanced. I think this formula should be used here.

.reduce(0, Integer::sum);

int optimalRecordsInBucket = allRecordCount / bucketCount;
int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestDataProvider testDataProvider = prepareRebalanceTestData(tableName);

//Try to do a rebalancing compaction
executeStatementOnDriver("ALTER TABLE " + tableName + " COMPACT 'rebalance'", driver);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that without explicit ORDER BY, there's an implicit order defined by the rebalance compaction query that fixes the order?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a test case. I wonder how it helps understanding rebalance compaction or the test case itself.

It is an information that I would really like as part of a full description of the feature, in the class that responsible for doing the feature and in the official documentation.

What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help understand why we assert the exact order of the rows is verified in this test case. I had wondered a short time, until I realized that the order is fixed.


/*
check if the test data is unbalanced
balanced if all the buckets contains between n / bucket count and n / bucket count + bucket count rows
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment https://github.com/apache/hive/pull/6487/changes#r3248007538.

Nit: contain (without s)


// Assert that we have multiple buckets
List<String> bucketFilenames = CompactorTestUtil.getBucketFileNames(fs, table, null, "base_0000001");
assertTrue(bucketFilenames.size() > 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert should be removed (we're checking for == 1 later).

Assert.assertFalse(isBalanced(tableName, testDataProvider));

// Please note, as the test tests rebalance compaction, not insert overwrite, it is not necessary to test if
// we have the exact same data after preparing the test data as we had at the source table.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would still be helpful to check the data here, in order to catch problems from other parts of Hive early. Maybe simply check the number of records in the table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let me add a check on the number of records. Usually, I'm against doing assertions at the arrange part of the tests but as I saw, those tests are already full with them so I don't think if it can cause a misunderstanding.

Assert.assertEquals(errorMessage, e.getCauseMessage());
Assert.assertEquals(ErrorMsg.COMPACTION_REFUSED.getErrorCode(), e.getErrorCode());
assertEquals(errorMessage, e.getCauseMessage());
assertEquals(ErrorMsg.COMPACTION_REFUSED.getErrorCode(), e.getErrorCode());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to do refactors in a separate ticket/commit. The official guideline says "reformat code unrelated to the issue being fixed: formatting changes should be separate patches/commits;". While this is not strictly a formatting change, it feels quite close to it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr... Sure. Let me check this. I suppose I have to reduce my IDE's capabilities and it can be done. Honestly, I just removed some code blocks from this class. Those changes were handled by the IDE and I didn't even notice them.

However, I think the official guideline is a little bit obsolete: I'm pretty sure I saw a discussion somewhere about the opposite, like please do not do trivial changes like small reformats, formatting, etc to reduce the load on the precommit jobs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, reducing the workload on the CI is indeed a valid concern. Is it possible to merge this PR with a rebase, instead of a squash? That way the import related changes could be move to a separate commit in the same PR. I wished Github had an interactive rebase feature for merging PRs.

I had a quick search, but couldn't find the discussion. Do you have a link by chance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to find it but couldn't :(

Comment on lines +552 to +553
assertTrue(expectedData.contains(rowData));
expectedData.remove(rowData);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid looking up the element twice, the return value of Set#remove() could be used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion right before this line does the exact same check.

Copy link
Copy Markdown
Contributor

@thomasrebele thomasrebele May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I wanted to say:
With

assertTrue(expectedData.remove(rowData));

we could avoid the call to expectedData.contains(rowData). It's a bit less readable, so I'll leave it up to you to decide which one to take.

TestDataProvider testDataProvider = prepareRebalanceTestData(tableName);

//Try to do a rebalancing compaction
executeStatementOnDriver("ALTER TABLE " + tableName + " COMPACT 'rebalance'", driver);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help understand why we assert the exact order of the rows is verified in this test case. I had wondered a short time, until I realized that the order is fixed.

*/

int optimalRecordsInBucket = expectedData.size() / bucketCount;
int maximumRecordCountInABucket = optimalRecordsInBucket + bucketCount - 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant "if the optimal row count is exactly 1". I just used these numbers to make my point clear. If we distribute the 11 rows to 5 buckets, the optimalRecordsInBucket would be 2. Then maximumRecordCountInABucket = 2 + 5 - 1 = 6. So we could get a distribution of 6 rows in the first bucket, 5 rows in the second, and the remaining 3 buckets are empty.

As far as I understood the rebalancing, we want to redistribute the $n :=$ allRecordCount rows to the $m :=$ bucketCount buckets, so that the number of rows of two buckets differs by at most 1. This is slightly related to the Pigeonhole principle.

If we distribute $n$ rows to $m$ buckets under the above condition, the buckets have at least $\Big\lfloor \frac{n}{m} \Big\rfloor$ rows and at most $\Big\lceil \frac{n}{m} \Big\rceil$ rows. Using an integer division /, we can rewrite $\Big\lceil \frac{n}{m} \Big\rceil$ as

int maximumRecordCountInABucket = (allRecordCount + bucketCount - 1) / bucketCount;

Afaik, this is the exact upper limit to expect if the buckets are balanced. I think this formula should be used here.

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants