Skip to content

Conversation

@kevinrr888
Copy link
Member

@kevinrr888 kevinrr888 commented Sep 23, 2025

Creates HDFSContextClassLoaderFactory which will replace the use of the (buggy) VFSClassLoader with URLClassLoader while still supporting a ClassLoader per context.

Creates SimpleHDFSClassLoaderFactory which will replace the use of the (buggy) VFSClassLoader with URLClassLoader while still supporting a ClassLoader per context:

* SimpleHDFSClassLoaderFactory will be the replacement for the ReloadingVFSContextClassLoaderFactory and uses URLClassLoader instead of VFSClassLoader
* Created SimpleHDFSClassLoaderFactoryTest
* Bumped accumulo-classloaders from accumulo 2.1.3 to 2.1.4
* Minor javadoc fix for ReloadingVFSContextClassLoaderFactory

Initial commit for apache#27
@kevinrr888 kevinrr888 self-assigned this Sep 23, 2025
@kevinrr888 kevinrr888 marked this pull request as draft September 23, 2025 20:11
@kevinrr888 kevinrr888 linked an issue Sep 23, 2025 that may be closed by this pull request
Copy link
Contributor

@ddanielr ddanielr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test the new classloader functionality.

I suggest pulling some changes out into a separate PR that we can merge in and just have this PR contain the new feature logic.

Comment on lines 130 to 138
@Override
public ClassLoader getClassLoader(String contextName) throws ContextClassLoaderException {
var expectedPath = userDefDir.resolve(contextName).resolve(MANIFEST_FILE_NAME);
var classLoader = classLoaders.getIfPresent(expectedPath);
if (classLoader == null) {
throw new ContextClassLoaderException(contextName);
}
return classLoader;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should have a few steps:

  1. Construct an hdfs path to the manifest file from the context name (the context name could be the full path to the manifest file, or it could be the directory that contains the manifest; alternatively, it could be just a relative directory name that is appended to some base URL that is set as an environment variable)
  2. Use the path to download the manifest file and read its contents
  3. Use the manifest file contents to download the files to a uniquely named local path (set by the user as an environment variable or system property)
  4. Construct a URLClassLoader using the local directory where the files were downloaded to, add it to the cache, and return it from this method

Step 3 is the complicated one, because it should be able to detect if another process is downloading the same context, and wait a reasonable, but brief, time for it to finish, before proceeding to do it itself. If it does proceed, it should do so in such a way that it doesn't matter which one finishes, in case the other one is still working.

If the context files are already downloaded, then it should just do a verify, and construct the URLClassLoader (if one wasn't already cached).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. I believe this is fully addressed in 7dce18c

* directories used now customizable via sys property
* now downloads files from HDFS to local filesystem
* renamed class and test
* remove initializing contexts in constructor: only init when needed by
  getClassLoader() call
Comment on lines 87 to 92
// Cache the class loaders for re-use
// WeakReferences are used so that the class loaders can be cleaned up when no longer needed
// Classes that are loaded contain a reference to the class loader used to load them
// so the class loader will be garbage collected when no more classes are loaded that reference it
private final Cache<Path,URLClassLoader> classLoaders =
CacheBuilder.newBuilder().weakValues().build();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want in separate PR: support cleanup of local context directories when no longer needed, or may want to leave it up to user to perform this cleanup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache here is still needed for the instance of the classloader being reused for the same context... cleaning up the files on disk when no processes are using the context is a separate thing, but reusing an existing classloader for the same context is core behavior and should be here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I just put this comment here since it's somewhat related to cleanup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for cleanup of local context directories: bbc67b6

Comment on lines 145 to 147
var localManifestFile = localContextDownloadDir.resolve(MANIFEST_FILE_NAME);
fs.copyToLocalFile(HDFSManifestFile, new Path(localManifestFile.toUri()));
LOG.info("Copied manifest file from HDFS {} to local {}", HDFSManifestFile, localManifestFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that you need to write the manifest file locally. You could just read the contents of the file from HDFS and pass it to Gson.

Copy link
Member

@ctubbsii ctubbsii Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking... one way to detect that the manifest has changed or not is to use a directory uniquely named based on the context name and the hash of the manifest (which itself contains hashes to the files in the manifest). So, if the contents of the context changes, you can detect it and treat it like a new context. To do that, though, you will probably have to check the remote manifest each time the context is requested, which could be expensive. Could cache the context -> Path.for(context-manifestHash) mapping for some short period of time to avoid looking for context changes too frequently.

In this way, you always have to read the contents of the manifest in memory (to hash it and create a unique local directory for it), so the value of adding the manifest to that directory would only be for troubleshooting/debugging, really.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my PR I approached this issue by:

  1. Fetching the Manifest file, computing a hash for it, and deserializing the json into a java object heirarchy that is kept in memory.
  2. The Manifest file contains the user specified interval for monitoring the manifest file location for manifest file changes.
  3. On that interval, refetch the Manifest file from the manifest file location, compute the hash to see if the file has changed. Respond accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the manifest file is changed (say for contextA), do we expect that change to be picked up soon e.g., on some interval? Or do we expect it to be picked up the next time we call getClassLoader(contextA)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on an interval. Checking the manifest file every time getClassLoader(contextA) is called could be too many HDFS calls (if the manifest is stored in HDFS).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the interval present a problem? example:
interval = 10s
manifest file = contextA, jar1, jar2
time=0s: getClassLoader(contextA) = class loader with jar1, jar2
time=1s: change manifest file = contextA, jar3, jar4
time=5s: getClassLoader(contextA) = class loader with jar1, jar2 <-- we expected to get classloader with jar3, jar4
time=10s: now we update the classloader for contextA (too late?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intention is to be eventually consistent. As long as it updates correctly and around the interval, then I think that is sufficient. Across the cluster the update can't be performed at exactly the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thank you

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with this approach initially, but changed to checking the manifest on calls to getClassLoader instead of on an interval. The interval approach does not work when caching values/only storing WeakReferences to the class loader, as the value is not immediately returned to the user, thus no strong reference is made and may be GC'd at any time. Using the approach suggested by @jschmidt10 in #28 to first check the modification time, this should be fine/not too expensive to check in getClassLoader

I fully explain the problem and my fix in bbc67b6 commit message

- New thread which checks for changes to the manifest files (which have already been cached) on a configurable interval, updating the cache as necessary.
- For any context, multiple threads may try to download the JARs, create the URLClassLoader, and cache the class loader at the same time, but only one will succeed (whichever completes first). If another thread has already started this operation, first waits a bit for it to complete. If it doesn't complete quickly, try ourselves. This avoids too many threads trying at the same time, but still safely allows for it if needed.
- Now only download JARs from HDFS to local filesystem (no longer download manifest file - just read it from HDFS).
- JARs are now checked after download to ensure the checksum matches what is present in the manifest file for that jar.
- New tests including a concurrency stress-test.
Comment on lines 281 to 291
// verify downloaded jar checksum matches what is in the manifest file
var computedChecksumLocalJar = checksumLocalFile(localTempJarPath);
if (!computedChecksumLocalJar.equals(jar.getChecksum())) {
throw new ContextClassLoaderException(contextName,
new IllegalStateException(String.format(
"checksum: %s of downloaded jar: %s (downloaded from %s to %s) did not match "
+ "checksum present: %s in manifest file: %s. Consider retrying and/or "
+ "updating the checksum for this jar in the manifest file",
computedChecksumLocalJar, jar.getJarName(), HDFSJarPath, localTempJarPath,
jar.getChecksum(), HDFSManifestFile)));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe want to retry download a few times in case the file got corrupted in download? (is this possible?)

public class HDFSContextClassLoaderFactory implements ContextClassLoaderFactory {
private static final Logger LOG = LoggerFactory.getLogger(HDFSContextClassLoaderFactory.class);
public static final String MANIFEST_FILE_NAME = "manifest.json";
private static final String HASH_ALG = "SHA-256";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the hash alg to be configurable?

Comment on lines 104 to 107
public static final String HDFS_CONTEXTS_BASE_DIR = "hdfs.contexts.class.loader.base.dir";
public static final String LOCAL_CONTEXTS_DOWNLOAD_DIR =
"local.contexts.class.loader.download.dir";
public static final String MANIFEST_FILE_CHECK_INTERVAL = "manifest.file.check.interval.sec";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is preferred: setting these as system props or setting them as (new) Accumulo properties. For now, have them as system props

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On larger systems, we've seen these periodic NameNode calls, like getting file info, put stress on it. These knobs might take a bit of tuning to figure out the sweet spot so I'd vote an Accumulo property so we don't have to bounce the tserver to try new values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with an Accumulo property. I can change this later if others agree as well, fine for now for simpler reviews/testing.

@kevinrr888 kevinrr888 changed the title WIP - New SimpleHDFSClassLoaderFactory New HDFSContextClassLoaderFactory Oct 6, 2025
- new dir `accumulo-classloaders/modules/classloaders`
- module moved from `accumulo-classloaders/modules/vfs-class-loader` to `accumulo-classloaders/modules/classloaders/vfs-class-loader`
- new module `accumulo-classloaders/modules/classloaders/hdfs-class-loader`
- new dir `accumulo-classloaders/modules/classloaders/jar-creation` with children `shell` (stores the scripts to make jars used in tests by both vfs-class-loader and hdfs-class-loader modules - scripts were originally under vfs-class-loader) and `templates` (stores the template class files used in creating the jars - originally under vfs-class-loader)
@kevinrr888 kevinrr888 marked this pull request as ready for review October 8, 2025 16:24
// if the final path already exists, we are replacing it with temp path, so recursively delete
// the final path
if (Files.exists(tempState.finalPath)) {
FileUtils.deleteDirectory(tempState.finalPath.toFile());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit dubious to me off the top of my head. I'm picturing a situation where we have operations in progress and we blow away the old jars for a brief period. I guess the takeaway is you should be using new contexts instead of patching the existing one to be safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems inherently risky to be changing e.g., contextA from using JAR1 to using JAR2 while operations are in progress on JAR1. A warning would probably be good in the documentation somewhere, but this seems more of a user consideration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this some more, and I don't think it's a concern. Once the classes are loaded, the JARs can be deleted without issue. Wrote a quick test to confirm, and this works without issue:

  @Test
  public void testing() throws Exception {
    HDFSContextClassLoaderFactory factory = new HDFSContextClassLoaderFactory();
    factory.init(null);

    writeManifestAndJarToHDFS(factory, HELLO_WORLD_JAR, HELLO_WORLD_JAR_RESOURCE_NAME, true);

    var classLoader = factory.getClassLoader(CONTEXT1);
    var clazz = classLoader.loadClass(HELLO_WORLD_CLASS);

    FileUtils.deleteDirectory(
        new File(System.getProperty(HDFSContextClassLoaderFactory.LOCAL_CONTEXTS_DOWNLOAD_DIR)));

    var helloWorldObj = clazz.getDeclaredConstructor().newInstance();
    var validate = clazz.getMethod("validate");
    validate.invoke(helloWorldObj);
  }

}

var HDFSManifestFileBytes = HDFSManifestFileIn.readAllBytes();
try (var tempState = createTempState(contextName, HDFSManifestFile,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that createTempState() is always followed by createClassLoader(), can we combine those two methods into one? Just pass the args of createTempState() into createClassLoader()? That'll eliminate a bit of duplication between this building of the classloader and the code above where the FileChecker does it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split these into two to allow multiple threads to attempt the creation at the same time for calls to getClassLoader() with the same context. Note that we only perform synchronization on the cache for createClassLoader(). createTempState() is called outside of this, which does most of the work:

      try (var tempState = createTempState(...)) {
        // atomically return stored value if it exists OR rename the temp dir to the final dir and
        // create and store the new class loader
        return classLoaders.computeIfAbsent(hdfsManifestFile, key -> {
          try {
            ...createClassLoader(tempState);...
          } catch (IOException e) {
            throw new IllegalStateException(e);
          }
        }).getClassLoader();
      }

- Avoid readAllBytes() for JAR files
- First check manifest file metadata for last modification time before checking the checksum (avoids reading entire file unnecessarily)
- Delay opening files until needed
- Avoid logging retry completion if a retry was never attempted
- Removed incorrect use of weakValues() for the Cache... Was originally Cache<Path,URLClassLoader> which made weakValues() appropriate at the time (since we return the value directly to the user, creating a strong reference on their end), but the value has since changed to a custom object with the class loader among other fields, so is no longer applicable (we still only return the class loader to the user, but don't create a strong reference to the whole custom object). Will address fix to support similar cleanup in another commit.
- Fixed logs not working: added appropriate pom dependency, added log4j2-test.properties file to hdfs-class-loader module
- Reduced visibility of many methods/classes
- Optimization to clone MessageDigest
- Improved some variable names
- Fixed a bug where the Cache was using weakValues() incorrectly. Cache value was originally URLClassLoader (value returned directly to user, creating strong reference). Cache value had since changed to be a custom object (FinalState), but was still using weakValues(). Value returned to user was still only the URLClassLoader part of FinalState, so a strong reference was created for the URLClassLoader, but not the actual value in the Cache, leading to Cache values being unexpectedly GC'd. Fixed this by:
	- replacing Cache with ConcurrentMap (since the Cache would have just been `CacheBuilder.newBuilder().build()`, functioning the same as a regular ConcurrentMap)
	- using WeakReference for the class loader in FinalState
	- ensuring a strong reference to the URLClassLoader exists until it is returned to the user
	- adding a new cleanup thread to cleanup map entries whose URLClassLoader has been GC'd
	- removed manifest file checker thread (would check if the manifest file for entries in the store had changed since being stored, updating the store and local context directory if needed). This periodic checker will not work when using WeakReference for the class loader in FinalState, as a strong reference is not immediately made. Instead of the checker thread, check on every call to getClassLoader(). This should be fine as now we first read the metadata to see if the manifest file has been touched since being stored. Only then do we read the whole file.
- New cleanup thread additionally deletes local context directories when their associated map entry is removed.
- Test changes and new test testStoreCleanup()
Did not need the context name being defined by both the context path and
also a key in the manifest json file. e.g., previously could have
"hdfs://test:1234/contexts/contextA" and the manifest file could have
"context":"contextA". Only need one of these.
Comment on lines +269 to +270
var storeCleaner = new Thread(new StoreCleaner(TimeUnit.SECONDS.toMillis(30)));
storeCleaner.start();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make this configurable or change this time. Just set to 30 seconds for now

@ctubbsii
Copy link
Member

I think @dlmarion 's PR #30 is closer to what we want, so I think we should close this, and work on polishing up that one.

@ctubbsii ctubbsii closed this Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write an classloader that replaces VFS functionality.

6 participants