-
Notifications
You must be signed in to change notification settings - Fork 28
Description
The spec should say something about what to expect when two (or more) sets are merged together.
It does not necessarily have to mandate a given behaviour, especially since I don’t think there would be a single merging behaviour that would fit all use cases. But it should at least describe what the possible behaviours are, and possibly recommend one as the “most sensible” and suitable as a default behaviour.
Some of the questions that arise when merging sets together:
About the individual mappings
-
Should duplicated mappings be dropped? If so, how to detect duplicated mappings?
-
What to do if two mappings have the same
record_id? (Note that this is different question from the previous one: two mappings may have the samerecord_idand yet be different – within a single set this is not supposed to happen, but this could very well happen when merging several sets.) -
What to do if mappings from one set have
record_idand mappings from the other set don’t? Drop allrecord_idfrom all mappings? Automatically generaterecord_idfor the mappings that don’t have one? Do nothing (which would result in an invalid set, since mixing mappings with arecord_idand mappings without such a record is not allowed)?
About the set metadata
- Which metadata to use for the final, merged set?
Possible options:
-
Use only the metadata from one of the merged sets (which one?).
-
For any given slot, use the value from the first set if present, otherwise the value from the second set.
-
For any given slot, only set the value in the resulting set if both input sets have the same value, otherwise do not set the value at all.
-
For any multi-valued slot, if both input sets have a different list of values, merge the values together in the resulting set.
Existing behaviours
FWIW, the current behaviour of SSSOM-Java’s sssom-cli tool is:
- merging all mappings without any attempt at dropping duplicates or fixing
record_id(though users have the possibility of doing that themselves using some SSSOM/T rules); - for single-valued set metadata slots, using the value from the first set only, whatever that value is (including no value);
- for multi-valued set metadata slots, merging the values from all sets, unless the
--no-metadata-mergeoption is used (in which case the behaviour is the same as for single-valued slots: only the values from the input set are used).
I have not looked yet at what sssom-py does.