Skip to content

Metadata parsing overhaul #26

@bryngemark

Description

@bryngemark

The ldmx-sw parameter dump parsing used to extract rucio metadata was originally written to handle a rather contained set of metadata variables. It is however very rigid and cannot really deal gracefully with parameter name changes. When updating to a new major or minor release of ldmx-sw there is almost always an initial job failure caused by parameter name changes interrupting the metadata extraction.

We are not using a json tree in rucio, but instead a flat list of key = value pairs. Instead of the LDCS developer guessing which parameters we want to keep, and freezing the key names in the rucio record while updating the lookup name to work with different versions, we should simply recurse through the parameter jump json tree and contruct the keys as we go.

Introducing this change also allows for a change in how metadata is kept. Storing all parameters instead of a chosen subset means that each job easily produces several thousand lines of metadata. However, for typical sample production, the overwhelming majority of those parameters is common to all files. The run number, wall time, job site etc are naturally different from job to job. For this reason, I have considered implementing metadata at the batch level instead, but I haven't figured out how to register and retrieve it with rucio which fundamentally operates on files.

The proposed solution is to use the first finishing job as a dataset metadata record.

  • For each job, when it is time to extract metadata, we first check if there is already a file registered for its dataset.
    • If not, we write all metadata to a record and are done. (This will be the reference record.)
    • If yes, we get its record.
      • If it contains a pointer to a metadata reference record file, we get that file's metadata.
      • If not, we have already found the reference metadata record.
    • Then we compare the current job's metadata to this reference record
      • We only keep the lines that are not already exactly present in the reference.
      • We add a line pointing to the metadata reference record file such that we can always look up the metadata with rucio.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions