-
Notifications
You must be signed in to change notification settings - Fork 12
Added HdfsUrlStreamHandlerFactory, set in LocalCachingContextClassLoaderFactory #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…derFactory Added HdfsUrlStreamHandlerFactory and set it as the URL StreamHandlerFactory in a static initializer in LocalCachingContextClassLoaderFactory to handle the hdfs protocol. Closes apache#47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new class includes copied and modified content from FsUrlConnection, FsUrlStreamHandler, and FsUrlStreamHandlerFactory in the org.apache.hadoop.fs package. Do we need to account for this in a NOTICE file or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. They are all Apache Licensed, produced at the Apache Software Foundation, so any updates to our NOTICE would just add the same info that is already there. You can add an inline comment, if you think it's needed, in case we need to grab any changes from upstream, but these are trivial things, so I don't think even that is really necessary.
|
Tested locally with an Accumulo instance and the process was able to download a jar file from HDFS. |
|
If we used URI instead of URL in the Resource class would that fix things w/o having to register this handler statically? Assuming we are registering a handler that supports creating an input stream, but we are not using that functionality? Is that correct? If that is correct, then we are creating and registering code that is not being used just so the string will work. That is what made we wonder if using URI instead would work. |
|
Ran experiment to see if URI was ok w/ hdfs. |
|
created #49 |
ctubbsii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try to do the Provider registration I suggested below instead of setting the factory, and if it works, I'll push an update to this PR.
| // URI#getPath returns null value if path contains relative path | ||
| // i.e file:root/dir1/file1 | ||
| // So path can not be constructed from URI. | ||
| // We can only use schema specific part in URI. | ||
| // Uri#isOpaque return true if path is relative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems to be addressing the fact that a URI could contain a relative path. However, that is not possible if it came from a file: URL, because URLs, by definition, are absolute paths. While it is possible to have a URI that represents a relative file: path, it is not possible for a URL to do that, and this URI came from a URL, so the relative Path situation is not possible here.
| // method. We are not using org.apache.hadoop.fs.FsUrlStreamHandlerFactory | ||
| // because it overrides the handling of the "file" protocol. | ||
| static { | ||
| URL.setURLStreamHandlerFactory(new HdfsUrlStreamHandlerFactory(new Configuration())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, in Java 9+, there's a better way to do this, that avoids the problem of needing to call this only once in the JVM. Instead of setting the factory, you can register a URLStreamHandlerProvider to handle the hdfs URLs. We can use @AutoService to generate the appropriate META-INF/services file.
|
In #49 I removed all calls to |
I commented on that. We should not do that. I explained why on that PR. |
I think #49 is cleaner because it avoids registering JVM wide resources (which could cause code outside the classloader plugin to change behavior). I really like the locality. In the current code FileResolver is where the download happens, if its happy w/ the URL/URI then things are fine. Looking at the java code that converts URIs to URLs the main thing it checks is if the URI has a scheme which #49 adds a check for in FileResolver. |
* Register a provider for the hdfs: URL stream handling, rather than override the system factory * Also fix SpotBugs and rename classes to match the naming convention from the classes they override (e.g. "URLStreamHandler" instead of "UrlStreamHandler")
It introduces the possibility of far more problems than it solves. Take a look at the changes I just pushed. This fixes it much more narrowly. All it does is say "Use Hadoop to open hdfs: URLs". Any other application's implementation would ultimately do the same thing, but if somebody really wanted something else to handle hdfs: URLs in their code, they can easily do so by manipulating the Java ServiceLoader files on their classpath (META/INF/services). In my change, we don't forcibly register it in code at all anymore. We just put a provider on the classpath, that the user can choose to omit/exclude without changing code (may require a rebuild of the jar, or using a plugin to transform the jar contents to remove the service file, but that's not very difficult).
My concern isn't with |
ea54c13 to
732749a
Compare
|
Testing locally works with the latest commit |
|
Moving the change from calling the static method to a service loader is an improvement. This is still making a global change to the VM which can make for fun debugging (have debugged code X that was broken by some odd global change that unrelated code Y did multiple times). I still think the URI change is more clean because more locality makes things more predicable and easier to reason about. These global VM changes are not made because the intent is to make this global change for all code in Accumulo, these global changes are being made just to get this specific code to work. I am not opposed to these change, just sharing my opinion. It seems both changes have some problems. |
Added HdfsUrlStreamHandlerFactory and set it as the URL StreamHandlerFactory in a static initializer in LocalCachingContextClassLoaderFactory to handle the hdfs protocol.
Closes #47