Added HdfsUrlStreamHandlerFactory, set in LocalCachingContextClassLoaderFactory #48

dlmarion · 2026-01-16T18:36:18Z

Added HdfsUrlStreamHandlerFactory and set it as the URL StreamHandlerFactory in a static initializer in LocalCachingContextClassLoaderFactory to handle the hdfs protocol.

Closes #47

…derFactory Added HdfsUrlStreamHandlerFactory and set it as the URL StreamHandlerFactory in a static initializer in LocalCachingContextClassLoaderFactory to handle the hdfs protocol. Closes apache#47

dlmarion · 2026-01-16T18:37:39Z

...der/src/main/java/org/apache/accumulo/classloader/lcc/util/HdfsURLStreamHandlerProvider.java

This new class includes copied and modified content from FsUrlConnection, FsUrlStreamHandler, and FsUrlStreamHandlerFactory in the org.apache.hadoop.fs package. Do we need to account for this in a NOTICE file or something?

No. They are all Apache Licensed, produced at the Apache Software Foundation, so any updates to our NOTICE would just add the same info that is already there. You can add an inline comment, if you think it's needed, in case we need to grab any changes from upstream, but these are trivial things, so I don't think even that is really necessary.

dlmarion · 2026-01-16T18:39:15Z

Tested locally with an Accumulo instance and the process was able to download a jar file from HDFS.

keith-turner · 2026-01-16T19:11:57Z

If we used URI instead of URL in the Resource class would that fix things w/o having to register this handler statically?

Assuming we are registering a handler that supports creating an input stream, but we are not using that functionality? Is that correct? If that is correct, then we are creating and registering code that is not being used just so the string will work. That is what made we wonder if using URI instead would work.

keith-turner · 2026-01-16T19:12:56Z

Ran experiment to see if URI was ok w/ hdfs.

jshell> new URL("hdfs://localhost/foo")
|  Exception java.net.MalformedURLException: unknown protocol: hdfs
|        at URL.<init> (URL.java:681)
|        at URL.<init> (URL.java:569)
|        at URL.<init> (URL.java:516)
|        at (#1:1)

jshell> new URI("hdfs://localhost/foo")
$2 ==> hdfs://localhost/foo

keith-turner · 2026-01-16T20:41:40Z

created #49

ctubbsii

I'm going to try to do the Provider registration I suggested below instead of setting the factory, and if it works, I'll push an update to this PR.

ctubbsii · 2026-01-16T20:52:16Z

...ader/src/main/java/org/apache/accumulo/classloader/lcc/util/HdfsUrlStreamHandlerFactory.java

+        // URI#getPath returns null value if path contains relative path
+        // i.e file:root/dir1/file1
+        // So path can not be constructed from URI.
+        // We can only use schema specific part in URI.
+        // Uri#isOpaque return true if path is relative.


This comment seems to be addressing the fact that a URI could contain a relative path. However, that is not possible if it came from a file: URL, because URLs, by definition, are absolute paths. While it is possible to have a URI that represents a relative file: path, it is not possible for a URL to do that, and this URI came from a URL, so the relative Path situation is not possible here.

ctubbsii · 2026-01-16T21:03:14Z

...src/main/java/org/apache/accumulo/classloader/lcc/LocalCachingContextClassLoaderFactory.java

+  // method. We are not using org.apache.hadoop.fs.FsUrlStreamHandlerFactory
+  // because it overrides the handling of the "file" protocol.
+  static {
+    URL.setURLStreamHandlerFactory(new HdfsUrlStreamHandlerFactory(new Configuration()));


Apparently, in Java 9+, there's a better way to do this, that avoids the problem of needing to call this only once in the JVM. Instead of setting the factory, you can register a URLStreamHandlerProvider to handle the hdfs URLs. We can use @AutoService to generate the appropriate META-INF/services file.

keith-turner · 2026-01-16T21:10:32Z

In #49 I removed all calls to URL.setURLStreamHandlerFactory in the test and the test are passing, so that indicates that change will probably also work in Accumulo w/ HDFS.

ctubbsii · 2026-01-16T21:15:45Z

In #49 I removed all calls to URL.setURLStreamHandlerFactory in the test and the test are passing, so that indicates that change will probably also work in Accumulo w/ HDFS.

I commented on that. We should not do that. I explained why on that PR.

keith-turner · 2026-01-16T21:27:55Z

In #49 I removed all calls to URL.setURLStreamHandlerFactory in the test and the test are passing, so that indicates that change will probably also work in Accumulo w/ HDFS.

I commented on that. We should not do that. I explained why on that PR.

I think #49 is cleaner because it avoids registering JVM wide resources (which could cause code outside the classloader plugin to change behavior). I really like the locality. In the current code FileResolver is where the download happens, if its happy w/ the URL/URI then things are fine. Looking at the java code that converts URIs to URLs the main thing it checks is if the URI has a scheme which #49 adds a check for in FileResolver.

* Register a provider for the hdfs: URL stream handling, rather than override the system factory * Also fix SpotBugs and rename classes to match the naming convention from the classes they override (e.g. "URLStreamHandler" instead of "UrlStreamHandler")

ctubbsii · 2026-01-16T22:15:03Z

I think #49 is cleaner because it avoids registering JVM wide resources (which could cause code outside the classloader plugin to change behavior).

It introduces the possibility of far more problems than it solves.

Take a look at the changes I just pushed. This fixes it much more narrowly. All it does is say "Use Hadoop to open hdfs: URLs". Any other application's implementation would ultimately do the same thing, but if somebody really wanted something else to handle hdfs: URLs in their code, they can easily do so by manipulating the Java ServiceLoader files on their classpath (META/INF/services). In my change, we don't forcibly register it in code at all anymore. We just put a provider on the classpath, that the user can choose to omit/exclude without changing code (may require a rebuild of the jar, or using a plugin to transform the jar contents to remove the service file, but that's not very difficult).

I really like the locality. In the current code FileResolver is where the download happens, if its happy w/ the URL/URI then things are fine. Looking at the java code that converts URIs to URLs the main thing it checks is if the URI has a scheme which #49 adds a check for in FileResolver.

My concern isn't with hdfs: URLs being represented as URIs. My concern is with us changing our entire code to use URIs, and then failing to account for all the edge cases where a file: URI (which can be relative) doesn't work the same as a file: URL (which must be absolute). There's just so much weirdness that can get pushed through. Using URI is hardly better than using String.

dlmarion · 2026-01-16T22:36:24Z

Testing locally works with the latest commit

keith-turner · 2026-01-16T22:52:30Z

Moving the change from calling the static method to a service loader is an improvement. This is still making a global change to the VM which can make for fun debugging (have debugged code X that was broken by some odd global change that unrelated code Y did multiple times). I still think the URI change is more clean because more locality makes things more predicable and easier to reason about. These global VM changes are not made because the intent is to make this global change for all code in Accumulo, these global changes are being made just to get this specific code to work. I am not opposed to these change, just sharing my opinion. It seems both changes have some problems.

Added HdfsUrlStreamHandlerFactory, set in LocalCachingContextClassLoa…

f5aa228

…derFactory Added HdfsUrlStreamHandlerFactory and set it as the URL StreamHandlerFactory in a static initializer in LocalCachingContextClassLoaderFactory to handle the hdfs protocol. Closes apache#47

dlmarion requested a review from ctubbsii January 16, 2026 18:36

dlmarion self-assigned this Jan 16, 2026

dlmarion commented Jan 16, 2026

View reviewed changes

ctubbsii reviewed Jan 16, 2026

View reviewed changes

Remove unneeded call to constructor in test

732749a

ctubbsii force-pushed the set-url-stream-handler branch from ea54c13 to 732749a Compare January 16, 2026 22:18

keith-turner mentioned this pull request Jan 16, 2026

use URI instead of URL #49

Open

Added HdfsUrlStreamHandlerFactory, set in LocalCachingContextClassLoaderFactory #48

Are you sure you want to change the base?

Added HdfsUrlStreamHandlerFactory, set in LocalCachingContextClassLoaderFactory #48

Uh oh!

Conversation

dlmarion commented Jan 16, 2026

Uh oh!

dlmarion Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ctubbsii Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

dlmarion commented Jan 16, 2026

Uh oh!

keith-turner commented Jan 16, 2026

Uh oh!

keith-turner commented Jan 16, 2026

Uh oh!

keith-turner commented Jan 16, 2026

Uh oh!

ctubbsii left a comment

Choose a reason for hiding this comment

Uh oh!

ctubbsii Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ctubbsii Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

keith-turner commented Jan 16, 2026

Uh oh!

ctubbsii commented Jan 16, 2026

Uh oh!

keith-turner commented Jan 16, 2026

Uh oh!

ctubbsii commented Jan 16, 2026

Uh oh!

dlmarion commented Jan 16, 2026

Uh oh!

keith-turner commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants