Skip to content

Conversation

@keith-turner
Copy link
Contributor

@keith-turner keith-turner commented Dec 23, 2025

This is still a work in progress, but its close to being complete. Modified the Accumulo Access code to only use String and unicode. Made the following changes.

  • Removed all methods that take byte[] in the API
  • Removed the many static entry points in the API and replaced them w/ a single static entry point.
  • Reworked the internal code to operate on char instead of byte.
  • Limited authorizations to valid unicode characters that are not ISO control characters by default. This is configurable though.

Realized it would be really nice if users could further limit authorizations to a smaller subset of characters while working on this. The existing Accumulo Access APIs had a ton of static entry points and there was no one place that configurable behavior could be introduced . That led to reworking the API to have a single entry point where configuration could be applied instead of the many entry static points. Currently this entry point only allows setting a configurable authorization validator, but it makes it easy to support future configuration. This new single entry point is conceptually similar to how Gson works, it was inspired by that.

Still puzzling about how Accumulo might use this code or if that is possible. Want to experiment w/ that some.

Have not yet run the benchmarks w/ these code changes.

fixes #88

@dlmarion
Copy link
Contributor

dlmarion commented Jan 2, 2026

Do we still need the quoting and unquoting of terms in an AccessExpression if the goal is String and Unicode only?

@keith-turner
Copy link
Contributor Author

Do we still need the quoting and unquoting of terms in an AccessExpression if the goal is String and Unicode only?

Yeah still need them, the quoting allows using characters that are reserved for the language itself in auths. Also allows the language to use new characters in the future that are also used in quoted expressions.

@keith-turner
Copy link
Contributor Author

Ran the benchmark against commit dd2ef2d from this PR and saw the following.

AccessExpressionBenchmark.measureBytesValidation           thrpt   12  14.172 ± 0.031  ops/us
AccessExpressionBenchmark.measureLegacyEvaluationOnly      thrpt   12  22.322 ± 0.150  ops/us
AccessExpressionBenchmark.measureLegacyParseAndEvaluation  thrpt   12   9.637 ± 0.384  ops/us
AccessExpressionBenchmark.measureParseAndEvaluation        thrpt   12  10.821 ± 0.112  ops/us
AccessExpressionBenchmark.measureStringValidation          thrpt   12  17.741 ± 0.429  ops/us

Then ran against c70d418 from main and saw the following.

AccessExpressionBenchmark.measureBytesValidation           thrpt   12  35.619 ± 2.259  ops/us
AccessExpressionBenchmark.measureLegacyEvaluationOnly      thrpt   12  22.444 ± 0.409  ops/us
AccessExpressionBenchmark.measureLegacyParseAndEvaluation  thrpt   12   9.495 ± 0.021  ops/us
AccessExpressionBenchmark.measureParseAndEvaluation        thrpt   12  16.947 ± 0.328  ops/us
AccessExpressionBenchmark.measureStringValidation          thrpt   12  28.937 ± 0.013  ops/us

Improved the benchmark times with the changes in 22af52d . These changes use a char[] array instead of String internally to avoid calling with charAt() on string so frequently.

AccessExpressionBenchmark.measureBytesValidation           thrpt   12  18.068 ± 0.386  ops/us
AccessExpressionBenchmark.measureLegacyEvaluationOnly      thrpt   12  22.262 ± 0.165  ops/us
AccessExpressionBenchmark.measureLegacyParseAndEvaluation  thrpt   12   9.413 ± 0.164  ops/us
AccessExpressionBenchmark.measureParseAndEvaluation        thrpt   12  12.825 ± 0.253  ops/us
AccessExpressionBenchmark.measureStringValidation          thrpt   12  22.166 ± 1.012  ops/us

Experimented w/ adding char[] methods to the public API (in addition to string methods) and those are faster when you already have a char array. However using the java CharDecoder class to create char[] is much slower than string, so its not a net win.

Comment on lines 55 to 60
QUOTED,
/**
* Denotes that an authorization seen in a valid access expression was unquoted. This means the
* expression only contains the characters allowed in an unquoted authorization.
*/
UNQUOTED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the predicate is used to evaluate a newly constructed Authorizations, then each individual authorization string in that will never be quoted. So, the concept of quoted/unquoted doesn't make sense for that case. Further, if you assume unquoted, and always pass that enum for that case, then it prevents a user from making a more restricted Predicate that ensures all authorizations are quoted or that all of them are unquoted. A user may want such a restricted predicate, to normalize their AccessExpressions, because the quotes still affect the lexical ordering of the expressions for keys stored in Accumulo, or the efficiency of the evaluator cache (because there could be different cache entries for the equivalent access expressions that differ only by optional quotes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the predicate is used to evaluate a newly constructed Authorizations

Improved the naming in f8492ab for this case.

* @see #builder()
* @since 1.0
*/
public interface AccumuloAccess {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need "Accumulo" prefix on this object. While Accumulo is the containing project, I think this class could just be called "Access".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made that change in 778c9e4. I wanted a shorted name for this class but was not sure what name to use. Access seems good to me.

Comment on lines 92 to 95
/**
* @return a pre-allocated empty Authorizations object
*/
Authorizations newAuthorizations();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably remove this overloaded method. A user can just pass an empty set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 8331f63

* valid
* @throws NullPointerException when any argument is null
*/
void findAuthorizations(String expression, Consumer<String> authorizationConsumer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would be useful to have on AccessExpression itself as a non-static method. I'm not sure if it would still be useful here or not, for the case where you don't want to construct the AccessExpression and just want to search in a String. But, I'd lean towards not being useful here. It's probably more useful only on AccessExpression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to move this out of the root level interface, however if that is done then expression would be parsed twice to do the following.

Access access = ...;
access.newExpression("a|b|c").findAuthorizations(System.out::println);

It will parse once to validate when creating the AccessExpression and then again when calling findAuthorizations().

*
* @throws NullPointerException when the argument is null
*/
String quote(String authorization) throws InvalidAuthorizationException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little weird to me to call this "quote" when sometimes it doesn't do any quoting, but I'm not sure of a better name, and probably wouldn't want to overload this with separate "alwaysQuote" and "quoteIfNeeded" methods. At least the Javadoc makes it clear.

* @throws InvalidAuthorizationException if the expression contains an invalid authorization
* @throws NullPointerException when the argument is null
*/
void validate(String expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like this here. I get the utility of having it to avoid the object allocation if you just want to quickly verify a bunch of expressions, but it feels out of place. This isn't really here to construct or use an object from this library... it's just kind of an extra utility method that allows users to evaluate their own data. But, I think what users really are asking if they were to use this method is "if I were to construct a new AccessExpression from this String, would it succeed?", and I think the only 100% reliable way to do that is to just create a new AccessExpression from this String, which kind of makes this method redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same as calling newExpression, except it does not allocate an object. So its an optimization. Could remove it, it can be added later. Also could rename it to validateExpression() which lines up w/ the newExpression() method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to validateExpression in 7a37593. Plan to evaluate removal based on trying to use the changes in accumulo. I believe this is used now in main, but not 100% sure.

* @param authorizations auths to use in the AccessEvaluator
* @return AccessEvaluator object
*/
AccessEvaluator newEvaluator(Authorizations authorizations);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this one is needed. It's basically a special case of the other, using a kind of "SetOfAuthorizationsAuthorizer"

Copy link
Contributor Author

@keith-turner keith-turner Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be kept because there are internal optimization that can only be done when the whole set is known. These optimization avoid string allocations by transforming the set into an internal representation. This is much faster.

*
* </table>
*/
AccessEvaluator newEvaluator(Collection<Authorizations> authorizationSets);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If AccessEvaluator were a FunctionalInterface (which I think it can, and should, be), I think this basically becomes something like:

var allEvaluators = authorizationsSets.stream().map(this::newEvaluator).collect(Collectors.toList());
AccessEvaluator aggregateEvaluator = expression -> allEvaluators.stream().allMatch(AccessEvaluator::canAccess);

for (AccessExpression expression : allExpressionsToEvaluate) {
  if (aggregateEvaluator.canAccess(expression)) {
    // do whatever
  }
}

I think this is something a user can do themselves (using streams or traditional loops) if they had this use case. I think this is a useful example to document, but I don't think we need it in the main API entry point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could be moved from the API to an example, opened #98 to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If AccessEvaluator were a FunctionalInterface

Could be tricky to achieve that because it could lead to forcing double parsing (once for validation and once of evaluation) depending on how its done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be a Record and if there is any benefit to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, it currently does not have equals(), hashCode(), or toString(). Also, it has a mutable field that is in atomic ref. It could be a record, but the atomic ref feels like it goes against the spirit of a record.

keith-turner added a commit to keith-turner/accumulo that referenced this pull request Jan 12, 2026
Updates to work with changes in apache/accumulo-access#96.

The new core/clientImpl/access/BytesAccess class consildates all the
code to work with the new APIs and handle byte[]->String correctly.
@keith-turner keith-turner marked this pull request as ready for review January 12, 2026 20:37
@keith-turner
Copy link
Contributor Author

Updated Accumulo to use these changes in apache/accumulo#6053 and took this out of draft.

@dlmarion
Copy link
Contributor

IMO we should merge this so that we can iterate on this and the changes in apache/accumulo#6053.

@keith-turner keith-turner merged commit 7049bc7 into apache:main Jan 12, 2026
6 checks passed
@keith-turner keith-turner deleted the aa-88 branch January 12, 2026 23:37
keith-turner added a commit to apache/accumulo that referenced this pull request Jan 15, 2026
Updates to work with changes in apache/accumulo-access#96.

The new core/clientImpl/access/BytesAccess class consildates all the
code to work with the new APIs and handle byte[]->String correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implementation does not match specification

3 participants