Skip to content

Conversation

@dlmarion
Copy link
Contributor

Added HdfsUrlStreamHandlerFactory and set it as the URL StreamHandlerFactory in a static initializer in LocalCachingContextClassLoaderFactory to handle the hdfs protocol.

Closes #47

…derFactory

Added HdfsUrlStreamHandlerFactory and set it as the URL StreamHandlerFactory
in a static initializer in LocalCachingContextClassLoaderFactory to handle
the hdfs protocol.

Closes apache#47
@dlmarion dlmarion requested a review from ctubbsii January 16, 2026 18:36
@dlmarion dlmarion self-assigned this Jan 16, 2026
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new class includes copied and modified content from FsUrlConnection, FsUrlStreamHandler, and FsUrlStreamHandlerFactory in the org.apache.hadoop.fs package. Do we need to account for this in a NOTICE file or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. They are all Apache Licensed, produced at the Apache Software Foundation, so any updates to our NOTICE would just add the same info that is already there. You can add an inline comment, if you think it's needed, in case we need to grab any changes from upstream, but these are trivial things, so I don't think even that is really necessary.

@dlmarion
Copy link
Contributor Author

Tested locally with an Accumulo instance and the process was able to download a jar file from HDFS.

@keith-turner
Copy link
Contributor

If we used URI instead of URL in the Resource class would that fix things w/o having to register this handler statically?

Assuming we are registering a handler that supports creating an input stream, but we are not using that functionality? Is that correct? If that is correct, then we are creating and registering code that is not being used just so the string will work. That is what made we wonder if using URI instead would work.

@keith-turner
Copy link
Contributor

Ran experiment to see if URI was ok w/ hdfs.

jshell> new URL("hdfs://localhost/foo")
|  Exception java.net.MalformedURLException: unknown protocol: hdfs
|        at URL.<init> (URL.java:681)
|        at URL.<init> (URL.java:569)
|        at URL.<init> (URL.java:516)
|        at (#1:1)

jshell> new URI("hdfs://localhost/foo")
$2 ==> hdfs://localhost/foo

@keith-turner
Copy link
Contributor

created #49

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try to do the Provider registration I suggested below instead of setting the factory, and if it works, I'll push an update to this PR.

Comment on lines 63 to 67
// URI#getPath returns null value if path contains relative path
// i.e file:root/dir1/file1
// So path can not be constructed from URI.
// We can only use schema specific part in URI.
// Uri#isOpaque return true if path is relative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to be addressing the fact that a URI could contain a relative path. However, that is not possible if it came from a file: URL, because URLs, by definition, are absolute paths. While it is possible to have a URI that represents a relative file: path, it is not possible for a URL to do that, and this URI came from a URL, so the relative Path situation is not possible here.

// method. We are not using org.apache.hadoop.fs.FsUrlStreamHandlerFactory
// because it overrides the handling of the "file" protocol.
static {
URL.setURLStreamHandlerFactory(new HdfsUrlStreamHandlerFactory(new Configuration()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, in Java 9+, there's a better way to do this, that avoids the problem of needing to call this only once in the JVM. Instead of setting the factory, you can register a URLStreamHandlerProvider to handle the hdfs URLs. We can use @AutoService to generate the appropriate META-INF/services file.

@keith-turner
Copy link
Contributor

In #49 I removed all calls to URL.setURLStreamHandlerFactory in the test and the test are passing, so that indicates that change will probably also work in Accumulo w/ HDFS.

@ctubbsii
Copy link
Member

In #49 I removed all calls to URL.setURLStreamHandlerFactory in the test and the test are passing, so that indicates that change will probably also work in Accumulo w/ HDFS.

I commented on that. We should not do that. I explained why on that PR.

@keith-turner
Copy link
Contributor

In #49 I removed all calls to URL.setURLStreamHandlerFactory in the test and the test are passing, so that indicates that change will probably also work in Accumulo w/ HDFS.

I commented on that. We should not do that. I explained why on that PR.

I think #49 is cleaner because it avoids registering JVM wide resources (which could cause code outside the classloader plugin to change behavior). I really like the locality. In the current code FileResolver is where the download happens, if its happy w/ the URL/URI then things are fine. Looking at the java code that converts URIs to URLs the main thing it checks is if the URI has a scheme which #49 adds a check for in FileResolver.

* Register a provider for the hdfs: URL stream handling, rather than
  override the system factory
* Also fix SpotBugs and rename classes to match the naming convention
  from the classes they override (e.g. "URLStreamHandler" instead of
  "UrlStreamHandler")
@ctubbsii
Copy link
Member

I think #49 is cleaner because it avoids registering JVM wide resources (which could cause code outside the classloader plugin to change behavior).

It introduces the possibility of far more problems than it solves.

Take a look at the changes I just pushed. This fixes it much more narrowly. All it does is say "Use Hadoop to open hdfs: URLs". Any other application's implementation would ultimately do the same thing, but if somebody really wanted something else to handle hdfs: URLs in their code, they can easily do so by manipulating the Java ServiceLoader files on their classpath (META/INF/services). In my change, we don't forcibly register it in code at all anymore. We just put a provider on the classpath, that the user can choose to omit/exclude without changing code (may require a rebuild of the jar, or using a plugin to transform the jar contents to remove the service file, but that's not very difficult).

I really like the locality. In the current code FileResolver is where the download happens, if its happy w/ the URL/URI then things are fine. Looking at the java code that converts URIs to URLs the main thing it checks is if the URI has a scheme which #49 adds a check for in FileResolver.

My concern isn't with hdfs: URLs being represented as URIs. My concern is with us changing our entire code to use URIs, and then failing to account for all the edge cases where a file: URI (which can be relative) doesn't work the same as a file: URL (which must be absolute). There's just so much weirdness that can get pushed through. Using URI is hardly better than using String.

@ctubbsii ctubbsii force-pushed the set-url-stream-handler branch from ea54c13 to 732749a Compare January 16, 2026 22:18
@dlmarion
Copy link
Contributor Author

Testing locally works with the latest commit

@keith-turner
Copy link
Contributor

Moving the change from calling the static method to a service loader is an improvement. This is still making a global change to the VM which can make for fun debugging (have debugged code X that was broken by some odd global change that unrelated code Y did multiple times). I still think the URI change is more clean because more locality makes things more predicable and easier to reason about. These global VM changes are not made because the intent is to make this global change for all code in Accumulo, these global changes are being made just to get this specific code to work. I am not opposed to these change, just sharing my opinion. It seems both changes have some problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Unknown protocol: hdfs" error when testing with Accumulo

3 participants