Script to generate and validate goldens for an import by ajaits · Pull Request #1905 · datacommonsorg/data

ajaits · 2026-03-09T12:07:29Z

Adding support for comparing output files against goldens.

Goldens can have a subset of columns. The validate_goldens verifies the output has all the combinations in the golden file.

Expected usage:

generate a set of golden output.csv that includes all important statvars and places using data_sampler.py
more in PR#1916
info a folder called golden_data.
enable golden validations in the per import validation config specifying the golden file to be compared against for each output in the manifest.json.
Relax the deletion threshold to allow higher deletions while ensuring goldens for prominent places and statvars have no deletions.

For more details, please refer to design doc

gemini-code-assist · 2026-03-09T12:07:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust new framework for validating data imports against predefined "golden" datasets. It provides utilities to both generate these golden sets from existing data, with advanced sampling and inclusion rules, and to perform comparisons to identify discrepancies. Additionally, it includes minor but important fixes to data processing logic, ensuring accurate handling of zero-value evaluations and proper CSV data interpretation during node loading.

Highlights

New Golden Validation Tool: Introduced validator_goldens.py to compare input data against expected "golden" sets, supporting flexible matching and normalization.
Golden Generation Capabilities: The new tool can generate golden sets from input data, including options for sampling, exhaustive sampling, and ensuring specific "must-include" values are present.
Enhanced Data Sampler: The data_sampler.py utility was updated to support "must-include" values and an "exhaustive" sampling mode, improving its utility for golden generation.
Improved Evaluation Handling: Fixed a bug in property_value_mapper.py to correctly process evaluation results that are numerically zero.
CSV Loading Refinement: Adjusted file_util.py to correctly load CSV data into dictionaries when using index-based keys, preventing unintended key-value parsing.

Changelog

tools/import_validation/validator_goldens.py
- Added a new script for validating data against golden sets.
- Implemented functions for generating golden sets from input data with various sampling and filtering options.
- Provided functionality to compare input nodes against golden nodes, identifying missing entries.
- Included utility functions for node fingerprinting and value normalization.
tools/import_validation/validator_goldens_test.py
- Added comprehensive unit tests for the validator_goldens.py script.
- Covered tests for configuration loading, value normalization, node fingerprinting, node comparison, and golden generation/validation workflows.
tools/statvar_importer/data_sampler.py
- Added new command-line flags sampler_uniques_per_column, sampler_exhaustive, and sampler_must_include.
- Integrated mcf_file_util for namespace stripping in must-include value processing.
- Enhanced the DataSampler class to support "must-include" values, ensuring specific data points are always sampled.
- Introduced an "exhaustive" sampling mode to capture all unique values.
- Updated configuration loading to handle new flags and exhaustive mode overrides.
tools/statvar_importer/property_value_mapper.py
- Corrected the evaluation logic to properly handle cases where an evaluated data value is 0, ensuring it is not incorrectly treated as None or an empty string.
util/file_util.py
- Modified file_load_csv_dict to prevent misinterpreting the second column as a value column when loading CSVs with key_index=True.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new tool for generating and validating 'golden' files for data imports, which is a great addition for ensuring data quality. The changes also include support for 'must-include' values during sampling and a fix for handling zero as a valid evaluation result.

My review has identified a critical bug in validator_goldens.py that would cause a NameError. I've also included suggestions to improve code quality by addressing mutable default arguments, removing leftover debug code, and refactoring duplicated logic. The new tests are comprehensive, but adding a test case for the load_must_include_values function would have caught the aforementioned bug.

tools/import_validation/validator_goldens.py

tools/statvar_importer/data_sampler.py

ajaits · 2026-03-30T18:59:46Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new GOLDENS validator to the import validation framework, enabling data verification against expected record sets. The implementation includes core validation logic in validator_goldens.py, integration into the Validator class and runner, and updated documentation. Additionally, the DataSampler utility was enhanced with exhaustive sampling and "must-include" column support. Feedback identifies a potential breaking change in node fingerprinting, memory efficiency concerns with large DataFrames, brittle path construction, and performance overhead in configuration lookups during sampling.

tools/statvar_importer/mcf_diff.py

tools/import_validation/validator.py

tools/import_validation/validator_goldens.py

tools/statvar_importer/data_sampler.py

scripts/us_fed/treasury_constant_maturity_rates/validation_config.json

tools/import_validation/README.md

tools/import_validation/validator_goldens.py

tools/import_validation/validator_goldens_test.py

tools/import_validation/validator_integration_test.py

tools/import_validation/README.md

tools/import_validation/validator_goldens.py

…dc-goldens

tools/import_validation/README.md

ajaits · 2026-04-10T11:50:45Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a GOLDENS_CHECK validator to the import validation toolset, enabling data verification against predefined "golden" records. The changes include the core validator implementation in validator_goldens.py, integration into the validation runner, comprehensive documentation, and new unit tests. Additionally, the PR updates the deleted records threshold to use percentages and provides utility improvements for MCF processing and file handling. Review feedback identifies invalid JSON syntax in the documentation examples and a leftover debug print statement that should be removed.

tools/import_validation/Validations.md

tools/import_validation/validator_goldens.py

ajaits added 2 commits December 30, 2025 19:53

Support 0 eval results

c641f9f

Goldens validator for import

58b4253

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

ajaits added 5 commits March 10, 2026 00:50

fix golden csv output

4128621

fix goldens generator

857cbde

move data_sampler to seperate pr

2960346

add GOLDENS validator

0252583

self review

92f4dcc

ajaits requested a review from vish-cs March 20, 2026 12:57

lint

9ac07d5

gemini-code-assist bot reviewed Mar 30, 2026

View reviewed changes

ajaits added 4 commits March 31, 2026 00:35

update master

c8eb263

sync

ca499d2

gemini review

543e8f5

gemini review

0579ee6

vish-cs reviewed Apr 6, 2026

View reviewed changes

ajaits added 7 commits April 6, 2026 18:03

Merge branch 'master' of https://github.com/datacommonsorg/data into …

c515da5

…dc-goldens

fix review comments

fb01985

Merge branch 'master' of https://github.com/datacommonsorg/data into …

8cf761c

…dc-goldens

Merge branch 'master' of https://github.com/datacommonsorg/data into …

200ba39

…dc-goldens

review comments

0ee2e4e

lint

a18c676

fix tests

a0ec13f

vish-cs reviewed Apr 10, 2026

View reviewed changes

tools/import_validation/README.md Outdated Show resolved Hide resolved

vish-cs approved these changes Apr 10, 2026

View reviewed changes

update readme

c130826

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

tools/import_validation/Validations.md Outdated Show resolved Hide resolved

tools/import_validation/Validations.md Outdated Show resolved Hide resolved

tools/import_validation/validator_goldens.py Outdated Show resolved Hide resolved

ajaits added 2 commits April 10, 2026 17:25

gemini review

5257056

update readme

f17b874

ajaits enabled auto-merge (squash) April 10, 2026 12:30

ajaits merged commit c278165 into datacommonsorg:master Apr 10, 2026
9 checks passed

Conversation

ajaits commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajaits commented Mar 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajaits commented Apr 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajaits commented Mar 9, 2026 •

edited

Loading