Skip to content

Command Line Interface for UNF#10

Open
kulnor wants to merge 13 commits intoIQSS:masterfrom
kulnor:master
Open

Command Line Interface for UNF#10
kulnor wants to merge 13 commits intoIQSS:masterfrom
kulnor:master

Conversation

@kulnor
Copy link
Copy Markdown

@kulnor kulnor commented Mar 17, 2026

I have recently implemented a UNF Python package and, to compare/validate the outputs, added a command-line utility to this Dataverse / Java implementation:

java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli --input <path> [options]

It outputs a comprehensive JSON document that includes file- and variable-level UNFs, along with options for the algorithm and the tool (see the test directory for simple examples).

Along the way, I have also generated a technical overview and a contributor guide. Note that this was mostly vibe-coded and should not affect the current code (it's an add-on).

FYI, I discussed this by email with Micah and Leo.

kulnor added 12 commits March 9, 2026 13:41
- Introduced a new CLI entry point in UnfCli.java for processing input files and generating JSON reports.
- Updated README.md with CLI usage instructions and examples.
- Added unf6_schema.json to define the schema for UNF v6 calculation results.
- Created UnfCliTest.java to validate the CLI functionality with unit tests.
- Exposed UnfCli.generateReport and CliOptions as public for programmatic usage.
- Added documentation for programmatic usage to README.md.
- Renamed doc/unf6.schema.json to doc/unf.schema.json and updated all references.
Copilot AI review requested due to automatic review settings March 17, 2026 17:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Java command-line interface to generate UNF v6 reports as JSON (file- and dataset-level), plus schema/docs and test fixtures to validate outputs and support cross-implementation comparisons (e.g., with dartfx-unf Python).

Changes:

  • Introduce org.dataverse.unf.UnfCli to compute UNFs for single files (line-based and CSV/TSV) and directories (dataset-level).
  • Add JUnit tests and dartfx CSV/JSON fixtures for validating computed UNFs.
  • Add JSON schema + documentation (README CLI usage, technical overview, contributor guide) and update .gitignore.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/main/java/org/dataverse/unf/UnfCli.java New CLI + report generator (JSON output, tabular parsing, dataset aggregation).
src/test/java/org/dataverse/unf/UnfCliTest.java Tests for CLI report generation on temp line/text and CSV inputs.
src/test/java/org/dataverse/unf/UnfDartfxTest.java Tests validating known file-level UNFs against dartfx example CSVs.
src/test/resources/test/dartfx/101A.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101A.unf.json Expected JSON report fixture for 101A.
src/test/resources/test/dartfx/101B.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101B.unf.json Expected JSON report fixture for 101B.
src/test/resources/test/dartfx/101C.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101C.unf.json Expected JSON report fixture for 101C.
src/test/resources/test/dartfx/101D.csv Dartfx sample CSV fixture.
src/test/resources/test/dartfx/101D.unf.json Expected JSON report fixture for 101D.
doc/unf.schema.json New JSON Schema documenting the UNF report shape.
doc/TECHNICAL_OVERVIEW.md Technical architecture overview of the UNF implementation.
doc/CONTRIBUTOR_GUIDE.md Contributor guidance emphasizing compatibility/stability and testing.
README.md Adds CLI usage docs and programmatic usage examples.
.gitignore Adds OS/editor/IDE ignores.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +74 to +77
List<Path> files = Files.list(inputPath)
.filter(Files::isRegularFile)
.sorted(Comparator.comparing(Path::getFileName))
.toList();
fileUnfs.add(entry.unf);
}
String datasetUnf = UNFUtil.calculateUNF(fileUnfs.toArray(new String[0]));
resultJson = datasetResultJson(inputPath.getFileName().toString(), datasetUnf, entries);
Comment on lines +37 to +38
private static final String SOFTWARE_VERSION = "6.0.2-SNAPSHOT";

Comment on lines +287 to +289
+ "\"N\":7,"
+ "\"X\":128,"
+ "\"H\":128,"
"version": {
"type": "string"
}
}
Comment on lines +532 to +535
case "--has-header":
options.hasHeader = Boolean.parseBoolean(requireValue(args, ++i, arg));
break;
case "--column-types":
Comment on lines +72 to +94
if (Files.isDirectory(inputPath)) {
List<FileResult> entries = new ArrayList<>();
List<Path> files = Files.list(inputPath)
.filter(Files::isRegularFile)
.sorted(Comparator.comparing(Path::getFileName))
.toList();

if (files.isEmpty()) {
throw new IllegalArgumentException("Input directory has no regular files: " + inputPath);
}

List<String> fileUnfs = new ArrayList<>();
for (Path file : files) {
FileResult entry = computeFileResult(file, options);
entries.add(entry);
fileUnfs.add(entry.unf);
}
String datasetUnf = UNFUtil.calculateUNF(fileUnfs.toArray(new String[0]));
resultJson = datasetResultJson(inputPath.getFileName().toString(), datasetUnf, entries);
} else {
FileResult fileResult = computeFileResult(inputPath, options);
resultJson = fileResult.toJson();
}
Comment on lines +233 to +241
case DATETIME:
if (datetimeFormat == null || datetimeFormat.isBlank()) {
throw new IllegalArgumentException("--datetime-format is required for type datetime.");
}
String[] rows = values.toArray(new String[0]);
String[] patterns = new String[rows.length];
Arrays.fill(patterns, datetimeFormat);
return UNFUtil.calculateUNF(rows, patterns);
default:
Comment on lines +31 to +47
Path tempFile = Files.createTempFile("unf-cli-string", ".txt");
Files.writeString(tempFile, "Hello World\nTesting 123\n", StandardCharsets.UTF_8);

UnfCli.CliOptions options = new UnfCli.CliOptions().withInput(tempFile.toString()).withType("string");
String json = UnfCli.generateReport(tempFile, options);

assertTrue(json.contains("\"unf_version\":\"6\""));
assertTrue(json.contains("\"type\":\"file\""));
assertTrue(json.contains("\"columns\""));
assertTrue(json.contains("\"unf\":\"UNF:6:r+FDbVC6fKdUjRS6ZIzP4w==\""));
}

@Test
void generateReport_csvFile_withTwoNumericColumns_returnsFileAndColumnUNFs() throws Exception {
Path tempFile = Files.createTempFile("unf-cli-table", ".csv");
Files.writeString(tempFile, "a,b\n6.6666666666666667,32\n75.216,2024\n", StandardCharsets.UTF_8);

4. Truncates to the most significant 128 bits.
5. Encodes as Base64 and prefixes with `UNF:<version>[:extensions]:`.

Current version string is `6` (`UnfDigest.currentVersion`).
@pdurbin pdurbin moved this to Ready for Triage in IQSS Dataverse Project Mar 17, 2026
@kulnor
Copy link
Copy Markdown
Author

kulnor commented Mar 17, 2026

Looks like Copilot has a few good suggestions. Let me know if you need me to resolve.

@landreev landreev self-assigned this Apr 2, 2026
@landreev
Copy link
Copy Markdown
Contributor

landreev commented Apr 2, 2026

@kulnor Thanks again for the PR.
I'm playing/experimenting with with the CLI now. I'm encouraged to see that it's generally doing a good job detecting or guessing data types of individual columns. For ex., in the test case we discussed in the email thread:

var1,var2,var3
1,1,2
2,,2
3,3,3
4,4,4

It properly identifies var1 and var3 as numeric, passes these vectors as Java arrays of the correct type to the UNFUtil proper, and gets the correct signatures, identical to the ones produced by Dataverse.
The problem with the second column, var2 should therefore be solvable, hopefully by some simple tweaks to the logic the CLI uses to make these educated guesses.

From a very quick look, the issue is likely this, and other similar methods: https://github.com/kulnor/UNF-dataverse/blob/master/src/main/java/org/dataverse/unf/UnfCli.java#L659-L666 - which just need to be adjusted so that they do not jump to conclusions prematurely.

Overall I am quite excited about having this interface added; which will give us a simple, but potentially very useful standalone tool, as as an extra way to calculate the UNFs outside of Dataverse.

@kulnor
Copy link
Copy Markdown
Author

kulnor commented Apr 2, 2026

Glad this is useful and that it may lead to a patch for type detection. I can also keep the generated JSON aligned with the one produced by the Python package (it is now, but in case I adjust it). We're starting to find other use cases for UNF, so we hope this helps grow the adoption.

Comment on lines +659 to +666
private static boolean isLong(String value) {
try {
Long.parseLong(value);
return true;
} catch (NumberFormatException ex) {
return false;
}
}
Copy link
Copy Markdown
Contributor

@landreev landreev Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we just need to modify this, and similar methods to be less rigid: just because a value is an empty string, should not be sufficient to assume that this is NOT a Long, Int ... etc. And the final decision should only be made on the combined vector, based on the types of the non-empty values in it - as long as some are present.

@landreev
Copy link
Copy Markdown
Contributor

landreev commented Apr 2, 2026

I will look at the Copilot suggestions carefully later on. (my own experience with it has been a mixed bag, in terms of the quality of its advice)

@landreev
Copy link
Copy Markdown
Contributor

landreev commented Apr 3, 2026

Glad this is useful and that it may lead to a patch for type detection.

Sorry if I'm just repeating the obvious, but just to put it on record, we are not talking about a patch for the Java UNF implementation as it exists now. This would involve some simple fixes in this PR - in the code of your CLI proper. Once again, the UNFUtil is just a dumb calculator that works on the vectors that are passed to it. We just need to fix the CLI so that it doesn't treat num. vectors with missing values as String arrays.

I'm happy to contribute the fixes. Just trying to decide how to go about it. As you saw, the repo didn't even have a develop branch; which says something about how abandoned this project was. I may merge your PR into a dedicated local branch and take it from there. I can give you a write on it too - but no pressure, obvs. if you are working on other things now.

@kulnor
Copy link
Copy Markdown
Author

kulnor commented Apr 3, 2026

Oh my, I actually didn't see the obvious: I created the issue. I was totally convinced that this inference was happening outside the CLI code. I'll look into this asap and amend the PR.

@kulnor
Copy link
Copy Markdown
Author

kulnor commented Apr 3, 2026

Just pushed the changes below. Made sure it didn't touch any other files (only UnfCli.java and README). Added a couple of related tests.


To support missing values (blanks/nulls) in CSV files without modifying the core UNFUtil library, I implemented three key changes in UnfCli.java:

  1. Selective Type Inference

The ColumnType.infer method was updated to be "blank-aware." It now uses a helper (all) that skips empty or whitespace-only strings during inference.

  • Before: A column with 1,,2 would fail theisInt check because of the middle blank, forcing the column to be treated as a string.
  • After: The check only applies to non-blank values. If 1 and 2 are valid integers and the rest are blanks, the column is correctly inferred as INT (numeric).
  1. Transition to Boxed Arrays (Object Arrays)
    All conversion methods (e.g., toIntArray, toDoubleArray, etc.) were refactored to return object arrays (like Integer[] or Double[]) instead of primitive arrays (like int[]).
  • Handling Blanks: When a blank string is encountered, instead of throwing a NumberFormatException, the converter now inserts a null at that position in the array.
  • Why this matters: Reference types allow us to represent a "missing value" as null, which the underlying UNF hashing engine understands as a missing entry (isna).
  1. Integrated UNF Vector Calculation
    The calculateColumnUnf method was updated to route these new arrays into the appropriate calculation paths:
  • Numeric Types (INT, DOUBLE, etc.): I used the existing UNFUtil.calculateUNF(Number[]) overload. This method is robust enough to handle Number[] (which our boxed arrays satisfy) and maps null values to Double.NaN, allowing for the correct "missing value" hashing.
  • Boolean Type: Since UNFUtil only had a primitive boolean[] overload (which cannot hold nulls), I bypassed it and called the public UnfDigest.unf(Boolean[]) directly. This ensures boolean columns with gaps are hashed accurately according to the UNF specification.
  • String/BitString: These were already using object arrays (String[] / BitString[]) and already supported null values via the existing UNFUtil methods.

These changes ensure the tool remains compatible with the existing UNFUtil library while providing the necessary flexibility to handle sparse real-world data.

@landreev
Copy link
Copy Markdown
Contributor

landreev commented Apr 8, 2026

This is great, thank you!
I will be away for the rest of this week, but my plan is to wrap up this PR first thing next week.

@kulnor
Copy link
Copy Markdown
Author

kulnor commented Apr 8, 2026

Thanks, and no rush. Apologies again for missing the issue and for the confusion I caused. 🥲

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants