Modified accumulo access to only work with String and Unicode. #96

keith-turner · 2025-12-23T19:56:44Z

This is still a work in progress, but its close to being complete. Modified the Accumulo Access code to only use String and unicode. Made the following changes.

Removed all methods that take byte[] in the API
Removed the many static entry points in the API and replaced them w/ a single static entry point.
Reworked the internal code to operate on char instead of byte.
Limited authorizations to valid unicode characters that are not ISO control characters by default. This is configurable though.

Realized it would be really nice if users could further limit authorizations to a smaller subset of characters while working on this. The existing Accumulo Access APIs had a ton of static entry points and there was no one place that configurable behavior could be introduced . That led to reworking the API to have a single entry point where configuration could be applied instead of the many entry static points. Currently this entry point only allows setting a configurable authorization validator, but it makes it easy to support future configuration. This new single entry point is conceptually similar to how Gson works, it was inspired by that.

Still puzzling about how Accumulo might use this code or if that is possible. Want to experiment w/ that some.

Have not yet run the benchmarks w/ these code changes.

fixes #88

dlmarion · 2026-01-02T14:19:58Z

Do we still need the quoting and unquoting of terms in an AccessExpression if the goal is String and Unicode only?

keith-turner · 2026-01-05T23:44:33Z

Do we still need the quoting and unquoting of terms in an AccessExpression if the goal is String and Unicode only?

Yeah still need them, the quoting allows using characters that are reserved for the language itself in auths. Also allows the language to use new characters in the future that are also used in quoted expressions.

keith-turner · 2026-01-07T00:28:01Z

Ran the benchmark against commit dd2ef2d from this PR and saw the following.

AccessExpressionBenchmark.measureBytesValidation           thrpt   12  14.172 ± 0.031  ops/us
AccessExpressionBenchmark.measureLegacyEvaluationOnly      thrpt   12  22.322 ± 0.150  ops/us
AccessExpressionBenchmark.measureLegacyParseAndEvaluation  thrpt   12   9.637 ± 0.384  ops/us
AccessExpressionBenchmark.measureParseAndEvaluation        thrpt   12  10.821 ± 0.112  ops/us
AccessExpressionBenchmark.measureStringValidation          thrpt   12  17.741 ± 0.429  ops/us

Then ran against c70d418 from main and saw the following.

AccessExpressionBenchmark.measureBytesValidation           thrpt   12  35.619 ± 2.259  ops/us
AccessExpressionBenchmark.measureLegacyEvaluationOnly      thrpt   12  22.444 ± 0.409  ops/us
AccessExpressionBenchmark.measureLegacyParseAndEvaluation  thrpt   12   9.495 ± 0.021  ops/us
AccessExpressionBenchmark.measureParseAndEvaluation        thrpt   12  16.947 ± 0.328  ops/us
AccessExpressionBenchmark.measureStringValidation          thrpt   12  28.937 ± 0.013  ops/us

Improved the benchmark times with the changes in 22af52d . These changes use a char[] array instead of String internally to avoid calling with charAt() on string so frequently.

AccessExpressionBenchmark.measureBytesValidation           thrpt   12  18.068 ± 0.386  ops/us
AccessExpressionBenchmark.measureLegacyEvaluationOnly      thrpt   12  22.262 ± 0.165  ops/us
AccessExpressionBenchmark.measureLegacyParseAndEvaluation  thrpt   12   9.413 ± 0.164  ops/us
AccessExpressionBenchmark.measureParseAndEvaluation        thrpt   12  12.825 ± 0.253  ops/us
AccessExpressionBenchmark.measureStringValidation          thrpt   12  22.166 ± 1.012  ops/us

Experimented w/ adding char[] methods to the public API (in addition to string methods) and those are faster when you already have a char array. However using the java CharDecoder class to create char[] is much slower than string, so its not a net win.

ctubbsii · 2026-01-07T18:42:02Z

core/src/main/java/org/apache/accumulo/access/AuthorizationValidator.java

+    QUOTED,
+    /**
+     * Denotes that an authorization seen in a valid access expression was unquoted. This means the
+     * expression only contains the characters allowed in an unquoted authorization.
+     */
+    UNQUOTED


In the case where the predicate is used to evaluate a newly constructed Authorizations, then each individual authorization string in that will never be quoted. So, the concept of quoted/unquoted doesn't make sense for that case. Further, if you assume unquoted, and always pass that enum for that case, then it prevents a user from making a more restricted Predicate that ensures all authorizations are quoted or that all of them are unquoted. A user may want such a restricted predicate, to normalize their AccessExpressions, because the quotes still affect the lexical ordering of the expressions for keys stored in Accumulo, or the efficiency of the evaluator cache (because there could be different cache entries for the equivalent access expressions that differ only by optional quotes).

In the case where the predicate is used to evaluate a newly constructed Authorizations

Improved the naming in f8492ab for this case.

ctubbsii · 2026-01-07T18:45:21Z

core/src/main/java/org/apache/accumulo/access/AccumuloAccess.java

+ * @see #builder()
+ * @since 1.0
+ */
+public interface AccumuloAccess {


I don't think we need "Accumulo" prefix on this object. While Accumulo is the containing project, I think this class could just be called "Access".

Made that change in 778c9e4. I wanted a shorted name for this class but was not sure what name to use. Access seems good to me.

ctubbsii · 2026-01-07T18:54:30Z