-
Notifications
You must be signed in to change notification settings - Fork 475
WIP - Preserve duplicate keys for scan resumptions #5957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 2.1
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be two scenarios where this will not work as expected.
First when reading duplicate keys from multiple files they may not always come back in the same order. So if 100 duplicate keys are spread across 3 files and 50 are read then when coming back and skipping 50 those may not be the same 50 values. The keys from multiple files are placed in a priority queue and it will have no guaranteed order for items with the same priority. If files are added or removed between reading scan batches this would probably greatly increase the chance of duplicate keys having a different scan order.
Second this change stores state on the server side. If a scan reads 50 of 100 duplicate keys from tserver A and then tserver A fails, then the scan will skip the other 50 when it resumes on another tservers.
One possible solution to this is to change server side scan batching such that it never ends a batch on a duplicate key or the scan fails if it would do that. So maybe we let the batch grow up to 2x or 3x the batch size (this could be configurable via a new setting) when duplicate keys are present and if it still fills up then the scan fails with an error. This way the scan either gets all the duplicate keys or it fails instead of getting incorrect data sometimes for this case.
Another possible solution is to sort data on the value when everything in the keys is equals. Then scans could continue based on key+value instead of just key. However persisted data is not currently sorted this way. This would be a really big change to code, data, and user creation of rfiles. It offers the advantage of avoiding having to buffer all duplicate keys on the server side..
|
Another thing to consider is that not all keys on the server side are MemKeys. Persisted data are not MemKeys. Only recently written data stored in inmemory maps are MemKeys. The purpose of the count on MemKeys is to avoid reading data written after a scan started. |
I took this approach in 9a7a523 |
| throw new IllegalStateException("Duplicate key run exceeded scan batch timeout for " | ||
| + cutKey + ". Increase " + Property.TABLE_SCAN_BATCH_DUPLICATE_MAX_MULTIPLIER.getKey() | ||
| + " or batch timeout, or reduce duplicates for this key."); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if stmt used to have a break in it, seems like that will still be needed. When in here we have finished batch and need to break out of the loop.
| continueKey = new Key(key); | ||
| skipContinueKey = true; | ||
| break; | ||
| if (!cutPending) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are adding a lot of new work per key/value read. Normally that work will be wasted effort and never needed. Instead of doing that, could instead do a check when closing off batch to see if we are in this situation where the keys are the same. Maybe something like the following, not sure how this would change the rest of the code. But hopefully this strategy could lead to not adding any new work for each key/value.
if (resultSize >= maxResultsSize || results.size() >= scanParams.getMaxEntries() || timesUp) {
// closing off a batch, need to see if the last key in the batch and the next unread key are the same
if(iter.hasTop()) {
iter.next();
if(iter.hasTop() && results.get(results.size()-1).getKey().equals(iter.getTopKey())) {
// the last key in the batch and the next key match, so can not terminate batch yet...
// not sure what this should do... it could start looping here or set a boolean that
// modifies the outer loop behavior... could also read the config here to determine what to
// do... in any case do not need to do extra work until now
// The best thing do to for efficiency is probably to completely handle the case of duplicates here,
// maybe call a function that reads the duplicate related config and loops. This would mean
// the normal loop can spend zero computation on this edge case. However, not sure if that
// could be done cleanly w/o introducing duplicate code. Depends on if the main loop and this
// new loop could share code maybe.
}
}
Fixes #5951
When a scan contained multiple duplicate keys and the scan broke or cut before returning all of those duplicates, the remainder of those duplicate keys were dropped and not returned when the scan resumed. This happened because when the scan cut, we did not store the
MemKeywhich contains an additional field,kvCountwhich when used, distinguishes otherwise duplicate keys. When we resumed the scan without these MemKey details, the scanned picked back up after the logical key we left off on, which incorrectly skips the remaining duplicate keys in the scan. We now store the full MemKey and when the scan resumes on the same key, skips only the one entry we already returned so the other duplicate keys are correctly returned now.The new test case fails without the rest of the changes and passes with them.