-
Notifications
You must be signed in to change notification settings - Fork 600
HDDS-14859. Ignore transient errors while validating RocksDb on volumes. #9947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
9f50fba
c2ffea9
7190e4e
a2160c2
7dc2bfd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,6 +25,7 @@ | |
| import jakarta.annotation.Nullable; | ||
| import java.io.File; | ||
| import java.io.IOException; | ||
| import java.time.Duration; | ||
| import java.util.Iterator; | ||
| import java.util.List; | ||
| import java.util.concurrent.ConcurrentSkipListSet; | ||
|
|
@@ -306,20 +307,34 @@ public synchronized VolumeCheckResult check(@Nullable Boolean unused) | |
|
|
||
| @VisibleForTesting | ||
| public VolumeCheckResult checkDbHealth(File dbFile) throws InterruptedException { | ||
| if (!getDiskCheckEnabled()) { | ||
| if (!(getDiskCheckEnabled() && getDatanodeConfig().isRocksDbDiskCheckEnabled())) { | ||
| return VolumeCheckResult.HEALTHY; | ||
| } | ||
|
|
||
| try (ManagedOptions managedOptions = new ManagedOptions(); | ||
| ManagedRocksDB ignored = ManagedRocksDB.openReadOnly(managedOptions, dbFile.toString())) { | ||
| // Do nothing. Only check if rocksdb is accessible. | ||
| LOG.debug("Successfully opened the database at \"{}\" for HDDS volume {}.", dbFile, getStorageDir()); | ||
| } catch (Exception e) { | ||
| if (Thread.currentThread().isInterrupted()) { | ||
| throw new InterruptedException("Check of database for volume " + this + " interrupted."); | ||
| // We attempt to open RocksDb twice to ignore any transient errors | ||
| // and to confirm that we actually cannot open RocksDb in readonly mode. | ||
| final int maxAttempts = getDatanodeConfig().getDiskCheckRetryAttempts(); | ||
| final Duration maxRetryGap = getDatanodeConfig().getDiskCheckRetryGap(); | ||
| for (int attempt = 0; attempt < maxAttempts; attempt++) { | ||
| try (ManagedOptions managedOptions = new ManagedOptions(); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this one can be created outside of for loop? do we need to call managedOption.close() for clean up? |
||
| ManagedRocksDB ignored = | ||
| ManagedRocksDB.openAsSecondary(managedOptions, dbFile.toString(), getTmpDir().getPath())) { | ||
| // Do nothing. Only check if rocksdb is accessible. | ||
| LOG.debug("Successfully opened the database at \"{}\" for HDDS volume {}.", dbFile, getStorageDir()); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. info? |
||
| break; | ||
| } catch (Exception e) { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are there exceptions that indicates permanent problem with opening vs. transient? |
||
| if (Thread.currentThread().isInterrupted()) { | ||
| throw new InterruptedException("Check of database for volume " + this + " interrupted."); | ||
| } | ||
|
|
||
| if (attempt == maxAttempts - 1) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Basically here it does retry for any failure, not just transient errors. Do we know which errora are transient that could be recovered by retry, and which errors cannot?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the idea is that if you can't open the RocksDb twice, the problem should definitely be counted as an error. There is no clear list of the possible transient errors, as such this check is purely defensive. From the RocksDb docs and comments, it appears that we may not be able to open the database in a Read-Only mode when any of these happen:
|
||
| LOG.error("Could not open Volume DB located at {}", dbFile, e); | ||
| getIoTestSlidingWindow().add(); | ||
| } else { | ||
| LOG.warn("Could not open Volume DB located at {}", dbFile, e); | ||
| Thread.sleep(maxRetryGap.toMillis()); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we want to track thread interruption for this sleep? |
||
| } | ||
| } | ||
| LOG.warn("Could not open Volume DB located at {}", dbFile, e); | ||
| getIoTestSlidingWindow().add(); | ||
| } | ||
|
|
||
| if (getIoTestSlidingWindow().isExceeded()) { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -75,6 +75,14 @@ public static ManagedRocksDB openReadOnly( | |
| ); | ||
| } | ||
|
|
||
| public static ManagedRocksDB openAsSecondary( | ||
| final ManagedOptions options, | ||
| final String dbPath, | ||
| final String secondaryDbLogFilePath) | ||
| throws RocksDBException { | ||
| return new ManagedRocksDB(RocksDB.openAsSecondary(options, dbPath, secondaryDbLogFilePath)); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when we open rocksdb as secondary vs. readonly do we open a use case for modifying data from secondary instance vs readonly instance?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Secondary and read-only instances are both read-only in behavior. Both won't allow writes. So we should be safe here. |
||
| } | ||
|
|
||
| public static ManagedRocksDB open( | ||
| final DBOptions options, final String path, | ||
| final List<ColumnFamilyDescriptor> columnFamilyDescriptors, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be at least 15m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't want a long time between two checks as the goal here is to ignore transient errors while opening RocksDb.
If the gap between two consecutive attempts at opening RocksDb is too long, it is no longer about transient errors.
Secondly, we can't have the time period between opening the RocksDb twice be 15 minutes as it will fail the disk check timeout of 10 minutes.
In case of errors when opening the RocksDb, we ideally want the thread to sleep for a very short time, try to open the RocksDb a second time and if we still fail, mark it as an actual failure to open RocksDb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also increase 10 mins timeout then? in case of 3rd retry we'll spend 2 out of 10 mins in sleep + time for actual processing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So 2 checks should be more than sufficient to definitively declare if we are failing to open RocksDb. After two checks we should let the sliding window handle if further checks should be required on that volume.
If we allow more than 2 checks then definitely the timeouts for each disk check will also have to become dynamic and the 10 minute threshold maybe too small. On the other hand we would like to know within 10 minutes if a disk is unhealthy or not as elongating this check pushes the future checks further out.