Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,17 @@
hs_err_pid*
.flattened-pom.xml
target/

# OS and editor artifacts
.DS_Store
.AppleDouble
.LSOverride
Icon?
._*
.Spotlight-V100
.Trashes

# Local IDE/editor settings
.vscode/
.idea/
*.iml
155 changes: 155 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,161 @@ The Universal Numerical Fingerprint (UNF) is a cryptographic signature of the ap
It is computed on the normalized (or canonicalized) forms of the data values, and is thus independent of the storage
medium and format of the data.

CLI Usage
---------

This package includes a CLI entrypoint that reads supported input files and emits JSON aligned with `doc/unf.schema.json`.

Build:

```bash
mvn -q -DskipUT=true package
```

Run from terminal (quick start):

```bash
cd ~/git-hub/UNF-dataverse
mvn -q -DskipUT=true package
java -cp target/unf-<version>-SNAPSHOT.jar org.dataverse.unf.UnfCli --help
java -cp target/unf-<version>-SNAPSHOT.jar org.dataverse.unf.UnfCli --input <path>
```

CLI Entry Point:

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli --input <path> [options]
```

Options:

- `--input <path>`: required. File or directory to process.
- `--output <path>`: optional. Writes JSON report to file. If omitted, prints to stdout.
- `--type <name>`: optional for non-CSV/TSV files. Default is `string`.
- `--datetime-format <fmt>`: required when `datetime` type is used.
- `--delimiter <char>`: optional delimiter override for tabular files.
- `--has-header <true|false>`: optional for tabular files. Default is `true`.
- `--column-types <list>`: optional comma-separated type list for CSV/TSV columns.
- `--help`: prints usage.

Supported Types:

- `string`
- `double`
- `float`
- `short`
- `byte`
- `long`
- `int`
- `boolean`
- `bitstring`
- `datetime`

Note: In the generated JSON output, all numeric types (double, float, short, byte, long, int) are reported as `"numeric"` since they are treated identically in UNF calculations (internally converted to doubles).

Examples:

Single-column text file (one value per line):

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/values.txt \
--type string
```

Date/time text file:

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/timestamps.txt \
--type datetime \
--datetime-format "yyyy-MM-dd'T'HH:mm:ss"
```

CSV with inferred column types:

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/data.csv
```

CSV with explicit per-column types:

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/data.csv \
--column-types int,int,string,double
```

TSV without header row:

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/data.tsv \
--has-header false
```

Force custom delimiter:

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/data.csv \
--delimiter ';'
```

Write output JSON to a file:

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/data.csv \
--output /path/to/report.unf.json
```

Dataset-level report from directory (regular files only, sorted by filename):

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli \
--input /path/to/dataset-dir
```

Operational Notes:

- For CSV/TSV, if `--column-types` is omitted, type inference is strict: every value in a column must parse for that type.
- Empty cells are not auto-converted to typed missing values during CSV/TSV parsing. Columns with blanks may infer as `string`.
- If a numeric column contains blanks and you force a numeric type with `--column-types`, parsing will fail.
- `--column-types` must include exactly one type per column.
- CSV/TSV parsing is delimiter-split based and does not implement quoted-field CSV escaping semantics.

```bash
java -cp target/unf-6.0.2-SNAPSHOT.jar org.dataverse.unf.UnfCli --help
```

Programmatic Usage
------------------

The UNF reporting logic can be used as a library by other Java applications.

```java
import org.dataverse.unf.UnfCli;
import java.nio.file.Path;

// 1. Configure options programmatically
UnfCli.CliOptions options = new UnfCli.CliOptions()
.withInput("data.csv")
.withType("string");

// (Optional) Configure advanced settings
options.hasHeader = true;
options.columnTypes = "int,string,double";

// 2. Generate the JSON report directly
String json = UnfCli.generateReport(Path.of("data.csv"), options);

// 3. (Optional) Use the built-in pretty printer
String prettyJson = UnfCli.prettyJson(json);
System.out.println(prettyJson);
```

License
-------

Expand Down
132 changes: 132 additions & 0 deletions doc/CONTRIBUTOR_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# UNF Dataverse Java Package: Contributor Guide

## Scope

This guide explains how to contribute safely to the UNF Java package while preserving output stability.

UNF behavior is sensitive: even small canonicalization changes can alter signatures and break compatibility.

## Project Layout

- `src/main/java/org/dataverse/unf/`: implementation
- `src/test/java/org/dataverse/unf/UNF6UtilTest.java`: fixture-driven unit tests
- `src/test/resources/test/`: expected-output fixtures by data type
- `doc/`: package documentation and examples

## Local Prerequisites

- JDK 17 (from `pom.xml`)
- Maven 3.x

Typical commands:
```bash
mvn clean test
mvn test
```

To run a single test class:
```bash
mvn -Dtest=UNF6UtilTest test
```

## Recommended Contribution Workflow

1. Pick a target area (API overload, canonicalization logic, utility behavior, or tests).
2. Read existing tests and corresponding fixture file(s) first.
3. Implement minimal change with explicit compatibility intent.
4. Run full tests (`mvn test`).
5. If behavior changes are intentional, update fixture expected values and document rationale in PR notes.

## How the Package Works (Contributor View)

### Public API and entry points

Most external callers use `UNFUtil.calculateUNF(...)` overloads.
`UNFUtil` performs input adaptation and delegates to `UnfDigest`.

### Digest orchestration

`UnfDigest`:
- routes data to type-specific handlers,
- controls matrix orientation (`trnps`),
- prefixes output with `UNF:<version>...`,
- combines multiple UNFs via `addUNFs(...)`.

### Canonicalization engines

- Numeric: `UnfNumber` + `RoundRoutines`
- String: `UnfString` + `RoundRoutines` / `RoundString`
- Boolean: `UnfBoolean`
- Bitfield: `UnfBitfield` + `BitString`
- Date/time: `UNFUtil` overloads + `UnfDateFormatter`

### Hash + encoding

All handlers eventually:
- feed canonical bytes into `SHA-256`,
- truncate to 128 bits,
- Base64-encode,
- return with UNF prefix.

## High-Risk Change Areas

Treat these as compatibility-sensitive:

- `RoundRoutines` and `RoundString` formatting logic
- missing/null sentinel handling (`UnfCons.missv`, null-byte behavior)
- date/time normalization and timezone treatment
- digest truncation size and Base64 conversion path
- sorting/combining logic in `UnfDigest.addUNFs(...)`

Any change in these areas can alter emitted UNFs for existing data.

## Testing Strategy

### Existing tests

`UNF6UtilTest` reads each fixture file where:
- first line = expected UNF
- remaining lines = values to hash

### When adding features

- Add or extend fixtures in `src/test/resources/test/`.
- Add clear unit coverage for new type branches or canonicalization cases.
- Include corner cases: null/missing, blanks, NaN/Infinity, timezone-bearing dates.

### Regression protection

If you intentionally change canonicalization:
- explain why prior output was incorrect or incomplete,
- update fixtures explicitly,
- include migration/backward-compatibility notes in PR description.

## Coding Conventions for This Package

- Preserve deterministic behavior.
- Prefer explicit conversions to avoid locale/platform drift.
- Keep algorithm constants centralized in `UnfCons`.
- Avoid introducing side effects in static state unless necessary.
- Keep public API overload behavior predictable and symmetric across types.

## Common Pitfalls

- Forgetting that `UnfDigest` uses static mutable state (`trnps`, `signature`, `fingerprint`).
- Changing default precision (`DEF_NDGTS`, `DEF_CDGTS`) without documenting compatibility impact.
- Updating parsing/format rules for dates without fixture updates.
- Treating formatting cleanups as cosmetic; they may be algorithmic.

## Suggested PR Checklist

- [ ] Tests pass locally (`mvn test`).
- [ ] New or changed behavior is covered by tests.
- [ ] Fixture updates are intentional and explained.
- [ ] Backward-compatibility impact is explicitly stated.
- [ ] Public API changes (if any) are documented.

## Useful Starting Points for New Contributors

- Read `UNFUtil` for API shape.
- Read `UnfDigest` for top-level flow and UNF composition.
- Read `UnfNumber` and `RoundRoutines` for numeric canonicalization details.
- Use `UNF6UtilTest` plus fixture files to understand expected outputs quickly.
Loading