Skip to content

Save hash and path grid informations when generating FK Tables#247

Open
kamillaurent wants to merge 5 commits into
mainfrom
hash_theory
Open

Save hash and path grid informations when generating FK Tables#247
kamillaurent wants to merge 5 commits into
mainfrom
hash_theory

Conversation

@kamillaurent
Copy link
Copy Markdown

@kamillaurent kamillaurent commented Apr 28, 2026

When a FK Table is generated, the hash information and path of the pineappl grid used for it are saved in the metadata of the FK Table.

These information can be read using the pinealppl-cli, for example:

pineappl read --show ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4

outputs, between the other informations, this lines:

grid_files: {"ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl": {"hash": "f1110fd0bb8ccddf28bf722f9f9c3dd5", "path": "/data/theorie/klaurent/FKTables/pineko/data/grids/41000000/ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4"}}

This PR is related to issue #225

Copy link
Copy Markdown
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
(haven't tested it, but I trust the OP is a real-life test)

Comment thread src/pineko/evolve.py Outdated
if grid_path is not None:
grid_path_obj = pathlib.Path(grid_path)
grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "path": str(grid_path_obj.resolve())}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "path": str(grid_path_obj.resolve())}}
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "theory_folder": grid_path_obj.parent.name}}

Out of privacy concerns, I would just store the path starting from the theory_slim folder.

My suggestion is, since you have the grid name as the key, save just the theory folder name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on the privacy, but I think I would keep the resolve(), because I think this way we can resolve symbolic links, right? i.e. .resolve().parent (or similar)

@scarlehoff
Copy link
Copy Markdown
Member

Hi @kamillaurent please do the small change requested and fix the issues found by pre-commit (usually just means you didn't run pre-commit, if you do the changes then git add the evolve.py file and run pre-commit it will be fixed.

Copy link
Copy Markdown
Contributor

@felixhekhorn felixhekhorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember the other things we discussed:

  • split in two simple keys "grid_path" and "grid_hash"
  • address FONLL magic

Comment thread src/pineko/evolve.py Outdated
if grid_path is not None:
grid_path_obj = pathlib.Path(grid_path)
grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
grid_files = {grid_path_obj.stem: {"hash": grid_hash, "path": str(grid_path_obj.resolve())}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on the privacy, but I think I would keep the resolve(), because I think this way we can resolve symbolic links, right? i.e. .resolve().parent (or similar)

@scarlehoff
Copy link
Copy Markdown
Member

@kamillaurent please address the changes and run pre-commit so that we can merge this

@scarlehoff
Copy link
Copy Markdown
Member

There's a few conflicts due to having merge @Radonirinaunimi's PR but they are in the documentation of the functions so it should be quick to fix. It would be good to have and use this PR for the next batch of fk tables so we have this information in 😅

Kamil Laurent added 2 commits May 26, 2026 09:57
@kamillaurent
Copy link
Copy Markdown
Author

Now I made the informations in the metadata more readable. For example, using the command
pineappl read --show ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4 one can find the informations:

...
grid_hash: f1110fd0bb8ccddf28bf722f9f9c3dd5
grid_path: /data/grids/41000000/ATLAS_WZ_TOT_13TEV-ATLASWZTOT13TEV81PB_Z_tot.pineappl.lz4
grid_theory: 41000000
...

When we use the FONLL procedure, in the last step many FK-tables get merged into one. I am working on retaining the info about hash and theory of the original FKs into the merged one.

Comment thread src/pineko/evolve.py Outdated
Comment on lines +117 to +131
if grid_path is not None:
grid_path_obj = pathlib.Path(grid_path).resolve()
grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
grid_path_parts = grid_path_obj.parts
if "pineko" in grid_path_parts:
pineko_idx = grid_path_parts.index("pineko")
display_path = str(pathlib.Path("/", *grid_path_parts[pineko_idx + 1 :]))
elif "data" in grid_path_parts:
data_idx = grid_path_parts.index("data")
display_path = str(pathlib.Path("/", *grid_path_parts[data_idx:]))
else:
display_path = grid_path_obj.name
fktable.set_metadata("grid_hash", grid_hash)
fktable.set_metadata("grid_theory", grid_path_obj.parent.name)
fktable.set_metadata("grid_path", display_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is far too much code duplication - this needs to be encapsulated
  2. do we really need grid_path? this looks fairly complicated ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the grids path part is a bit more complicated than needed, data/grids are not important.

Once you have the grid path, the "path of the grid itself" is grid_path.name and the theory folder grid_path.parent.name. You don't need anything else.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to simplify the metadata, using a function and calling it when the FK table is constructed (so there is no code duplication):

def set_fktable_metadata(fktable, theory_meta, grid_path=None):
    """Store the common FK metadata and, if available, grid provenance."""
    fktable.set_metadata("pineko_version", version.__version__)
    fktable.set_metadata("theory_card", json.dumps(theory_meta))
    if grid_path is None:
        return

    grid_path_obj = pathlib.Path(grid_path).resolve()
    grid_hash = hashlib.md5(grid_path_obj.read_bytes()).hexdigest()
    fktable.set_metadata("grid_hash", grid_hash)
    fktable.set_metadata("grid_theory", grid_path_obj.parent.name)
    fktable.set_metadata("grid_name", grid_path_obj.name)

For consistency I moved into the function also pineko_version and theory_card. I removed the path since the relevant information is the name of the folder, which is also the theory ID.

@kamillaurent
Copy link
Copy Markdown
Author

@scarlehoff regarding the FONLL procedure, fonll.py merges all the FK-Tables into the $THEORY_ID_00 and retains informations only from this one. For example, pineappl read --show 41000000/HERA_NC_225GEV_EP_SIGMARED.pineappl.lz4 gives

...
eko_version: 0.15.3
fk_assumptions: Nf4Ind
fktable_id: HERACOMBNCEP460
grid_hash: 0a34218e5e6b254475243264d1ced40e
grid_name: HERA_NC_225GEV_EP_SIGMARED.pineappl.lz4
grid_theory: 4100000000
hepdata: 10.17182/hepdata.68951.v1/t4
nnpdf_id: HERACOMBNCEP460
...

Is this sufficient or we want informations about all the merged grids? In the case we want all the merged grids, the grid_name will be the same for all. The grid_hash and grid_theory will be different, should we store them into an array and store the array in the metadata of the merged FK Table?

@scarlehoff
Copy link
Copy Markdown
Member

I would say that grid theory should strip the extra 0s (so it should say just 41000000, since that covers all).
Grid name is fine.

And my preference for the hash would be simply a string of

00-0a34218e5e6b254475243264d1ced40e 01-hash 02-hash 03-...

@felixhekhorn
Copy link
Copy Markdown
Contributor

Are you sure multiple plain keys wouldn't be better? Like grid_name_00?

In that case we should also mark the standard field grid_name: /FONLL or something that can never be a hash.

I'm thinking about what would be easiest to parse afterwards...

@scarlehoff
Copy link
Copy Markdown
Member

scarlehoff commented May 28, 2026

In my view, the important part about the hash is that it is unique in the comparison so making sure that all fktables have the key grid_hash and that the string is unique is sufficient and preferable.
The fact that one of the hashes is made by multiple hashes is irrelevant.

And, for the cases in which the hashes are not identical, then one still has the information about the partial grids but most of the time we want a yes/no question to the "are these coming from the same grids" question. It is enough for one to be different.

@kamillaurent
Copy link
Copy Markdown
Author

Done! Now for a merged FK-Table the metadata contains the main theory and the hash of all the original (merged) FKs. For example:
pineappl read --show 41000000/HERA_NC_225GEV_EP_SIGMARED.pineappl.lz4 gives

...
fktable_id: HERACOMBNCEP460
grid_hash: 00-0a34218e5e6b254475243264d1ced40e 01-efd1748bb08c25c43a22971cfe8699a9 02-6d0e6e92977898c4f052c05b4880eb51 03-bf95481ca46d9859e57476a0c3ded8db 04-fe3403683c505c439ef9e0dbfbdb74a7 05-df3dc24b51036ff6f2a97d93c38f445d 06-a70df1c6de56f8f5acb0229f18faa250
grid_name: HERA_NC_225GEV_EP_SIGMARED.pineappl.lz4
grid_theory: 41000000
...

@scarlehoff
Copy link
Copy Markdown
Member

@kamillaurent make sure to run pre-commit on the changed files so that the CI doesn't complain!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants