Skip to content

Specify or at least document the expectations for merging sets #465

@gouttegd

Description

@gouttegd

The spec should say something about what to expect when two (or more) sets are merged together.

It does not necessarily have to mandate a given behaviour, especially since I don’t think there would be a single merging behaviour that would fit all use cases. But it should at least describe what the possible behaviours are, and possibly recommend one as the “most sensible” and suitable as a default behaviour.

Some of the questions that arise when merging sets together:

About the individual mappings

  • Should duplicated mappings be dropped? If so, how to detect duplicated mappings?

  • What to do if two mappings have the same record_id? (Note that this is different question from the previous one: two mappings may have the same record_id and yet be different – within a single set this is not supposed to happen, but this could very well happen when merging several sets.)

  • What to do if mappings from one set have record_id and mappings from the other set don’t? Drop all record_id from all mappings? Automatically generate record_id for the mappings that don’t have one? Do nothing (which would result in an invalid set, since mixing mappings with a record_id and mappings without such a record is not allowed)?

About the set metadata

  • Which metadata to use for the final, merged set?

Possible options:

  • Use only the metadata from one of the merged sets (which one?).

  • For any given slot, use the value from the first set if present, otherwise the value from the second set.

  • For any given slot, only set the value in the resulting set if both input sets have the same value, otherwise do not set the value at all.

  • For any multi-valued slot, if both input sets have a different list of values, merge the values together in the resulting set.

Existing behaviours

FWIW, the current behaviour of SSSOM-Java’s sssom-cli tool is:

  • merging all mappings without any attempt at dropping duplicates or fixing record_id (though users have the possibility of doing that themselves using some SSSOM/T rules);
  • for single-valued set metadata slots, using the value from the first set only, whatever that value is (including no value);
  • for multi-valued set metadata slots, merging the values from all sets, unless the --no-metadata-merge option is used (in which case the behaviour is the same as for single-valued slots: only the values from the input set are used).

I have not looked yet at what sssom-py does.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions