Proposal: An Ordered list w/ unique keys can be used in place of a unordered dictionary#203
Proposal: An Ordered list w/ unique keys can be used in place of a unordered dictionary#203DavidSagan wants to merge 4 commits intomainfrom
Conversation
…a unordered dictionary.
| with the restriction that no duplicate keys can be present in the list. | ||
| For example, the above dictionary written as a list would look like: | ||
| ```{code} yaml | ||
| this_dictionary_expressed_as_list: |
There was a problem hiding this comment.
Are you referrring to named_dictionary above?
Aka saying
- named_dictionary:
key3: value3
key4: value4is the same as
- named_dictionary:
- key3: value3
- key4: value4Sorry, but I fear this is not a good idea, it adds ambiguity in parsing and complicates everything for no obvious gain.
There was a problem hiding this comment.
Personally I don't care. You and @EZoni can come to some agreement on what is best.
There was a problem hiding this comment.
Key differences:
| Example 1 | Example 2 | |
|---|---|---|
| Structure of named_dictionary | A single mapping/dict | A list of mappings/dicts |
| Access value3 (Python) | data[0]["named_dictionary"]["key3"] | data[0]["named_dictionary"][0]["key3"] |
| Can have duplicate keys? | No (keys must be unique in a dict) | Yes (each dict is separate) |
| Order guaranteed? | Depends on implementation | Yes (it's a list) |
The choice between them depends on your use case. If the keys are unique and represent a fixed structure, a single dictionary (Example 1) is simpler. If you need an ordered collection or might have duplicate keys, a list of dictionaries (Example 2) is more appropriate.
There was a problem hiding this comment.
Personally I don't care. You and @EZoni can come to some agreement on what is best.
Sounds good.
I would simply not relax interpretation and keep it clear. I.e., not adding the case in this PR (not merging it).
There was a problem hiding this comment.
Major problem
This is not a great idea as it complicates parsing and introduces ambiguities.
We will regret it later in the parsers and have to dynamically branch in the schema application all the time.
Minor Problem
It is also not standard how YAML is interpreted (i.e. YAML 101 is the section this adds to.)
But I oppose this for the more fundamental reason above, not for where it is written.
There was a problem hiding this comment.
Review summary for status on the PR: I oppose this, this is not a good idea.
Details described here: #203 (comment)
@ax3l You need to articulate your objection. In fact you advocated the parser being able to preserve dictionary order if possible and this just does the same thing. And this works even in the case where the parser is not able to preserve order. |
|
This is just the summary/status, let me link the comment for clarity. |
I don't see this as an objection. What exactly are you worried about? |
|
Both things are not equivalent in YAML; if we choose to relax them to be equivalent for PALS this will cause a problems writing and parsing schemas. Lastly, in my view there is not a strong motivation why this complication is needed at all, lead by an example where this occurs at all please:
C++ has multiple ways to express insertions preserving dicts and also the YAML library you picked actually supports it. There is a bug in your implementation. |
|
Here you go: Detailed bug report: pals-project/pals-cpp#18 |
I don't see how this causes problems for writing and parsing. If you think there is a problem please come up with an example. In terms of why this is a desirable feature, I quote what your wrote: |
FundamentalNo, please fix pals-project/pals-cpp#18 As the reference implementation in C++, Banter
My whole motivation behind the |
OK will fix but this is independent of this PR. |
|
Then PR has no motivating real-world use case left, and we can close it. I propose to drop |
I would agree with you if this represented a significant complication. But it does not. To have a parser handle this is literally a few lines of code. And once implemented there is no further maintenance needed. |
|
That is not correct, this text would add a conditional branch check for every dict to be a list, too. That is extremely random and confusing and will cause a lot of headache if enacted. Every dict has to be checks to be a list. Files will have all kinds of formatting. Typing will be a mess. This will cause confusion. You cannot write clear static scheme. Or just run it through an LLM of your choice and ask it if it is a great idea for a portable standard that shall be easy to implement and verify. Again, there is no real world need for this. Why do we even discuss this? |
When programmed correctly, the conditional branch is handled by the same low level routine that is written once and is transparent to the higher level code. Again, this can be done with a few lines of code. And I can see a real world case for this for some extensions where order matters in a construct that PALS specifies as unordered. |
No. And you ignore what I wrote on schemas and the other points I made why this is a bad idea.
No sorry, this is a super hypothetical need. I do strongly recommend against this. My reasons are above. |
You have not justified your assertions. I have told you how a parser can handle this very simply. If you want more detail I can supply it. And there are other ways of simply handling this by converting from list to ordered dict on input. If you think that this complicates things outside of any parsing please be more specific. And example would help. |
|
Let's start with the motivation for this PR: why would the PALS standard specify that "something must be unordered"? Please justify and motivate the actual need first.
Where is this needed? There is no example here that one can follow. |
|
Here is a Claude summary of your change:
Absolutely, I will give you now 4 problems it causes + examples for your fully motivation-example-free proposal. |
1. It introduces representational ambiguity.The same logical data now has two valid serializations. Every tool, validator, and parser that consumes these lattice files must handle both representations identically. This doubles the surface area for bugs and increases the cognitive load on anyone reading or writing these files. A contributor looking at two lattice files might see what appears to be structurally different data that is actually semantically identical. ExampleTwo files describe the same beamline (ignore details), but look structurally different: File A (dictionary form): facility:
drift1:
kind: Drift
length: 0.25
quad1:
kind: Quadrupole
MagneticMultipoleP:
Bn1: 1.0
length: 1.0
File B (list form): facility:
- drift1:
kind: Drift
length: 0.25
- quad1:
kind: Quadrupole
MagneticMultipoleP:
Bn1: 1.0
length: 1.0Are these the same? A human reading them might not be sure. A diff tool will flag them as completely different. A code review becomes harder because a contributor could switch forms arbitrarily, and git diff will show a large structural change that is semantically a no-op. |
2. It conflates two distinct data models.A list of key-value pairs and a dictionary are fundamentally different structures. By declaring them interchangeable (under constraints), you're essentially asking every consumer to implement a normalization step — "if you see a list of single-key mappings with no duplicate keys, treat it as a dict." This is implicit schema logic that lives outside the schema itself. Every consumer of the schema must now implement normalization logic. Consider a tool that looks up an element by name: # With a dict, this is trivial:
quad_params = lattice["facility"]["quad1"]
# With the list-of-dicts form, you need:
quad_params = None
for entry in lattice["elements"]:
if "quad1" in entry:
quad_params = entry["quad1"]
breakNow imagine a validation library, a conversion tool to MAD-X format, a visualization tool, and a simulation runner. Each of these independently must include branching logic: def get_elements(lattice):
elems = lattice["facility"]
if isinstance(elems, list):
# convert list-of-single-key-dicts to ordered dict
result = {}
for item in elems:
result.update(item)
return result
elif isinstance(elems, dict):
return elems
else:
raise SchemaError("Invalid elements format")Every tool in the ecosystem reimplements this, slightly differently, and each is a potential source of bugs. |
3. The duplicate-key restriction is fragile.The constraint "no duplicate keys in the list" must be enforced at validation time, not by the data format itself. YAML won't stop you from writing duplicate keys in a list. So now you need a custom validator for something that dictionaries give you for free (or at least by convention: YAML's handling of duplicate keys in mappings is itself underspecified, but most parsers will warn or take the last value). Nothing in YAML prevents this: facility:
- drift1:
kind: Drift
length: 0.25
- quad1:
kind: Quadrupole
MagneticMultipoleP:
Bn1: 1.0
length: 1.0
- drift1:
kind: Drift
length: 0.5This is perfectly valid YAML. The list happily contains two entries keyed quad1. A dictionary would have silently overwritten the first or raised a warning, but the list form gives no such signal. (This is an example and happens at every level of the PALS hierarchy, even if we allow duplicate names in this specific list of the example.) The schema now requires a custom validator to catch this: (your "this is just a few lines, don't worry, Axel, I got this") def validate_no_duplicate_keys(elements_list):
keys_seen = set()
for item in elements_list:
for key in item:
if key in keys_seen:
raise ValidationError(f"Duplicate key '{key}' in elements list")
keys_seen.add(key)If any one tool in the ecosystem forgets this check, it will silently process the file with unpredictable behavior: maybe using the first quad1, maybe the last, maybe both. |
4. It solves a problem that's largely been solved.Python 3.7+ guarantees dict insertion order. Most modern YAML libraries preserve mapping order (you can pick a proper one for a PALS reference implementation). C++ supports insertion-order stable maps. If the concern is interoperability with languages or tools that don't preserve order, the cleaner solution is to specify that compliant parsers must use order-preserving mappings, rather than introducing an alternative representation. Example: pals-project/pals-cpp#18 |
|
First of all, I do not accept Claude as an authority on any of this. Especially since answers may be manipulated depending upon the questions. There is a much better way to gauge this. Just ask people if they think they would be confused. |
There are many checks a validator must do. This includes spelling checks, etc. What you show would be a small part of validation. |
|
Yes a validator can check many things, but this misses the point. The difference is that every other validation check catches errors that are inherent to the problem domain: wrong element types, out-of-range parameters, misspelled names. Those errors exist regardless of how you design the schema. The key/list/duplicate check is different: it only exists because the schema introduced a representation that makes it possible. You're not catching a physics mistake, you're catching an artifact of a design choice. A dictionary makes this class of error structurally impossible. The best validation check is the one you don't need to write. |
I agree. But I believe this is being blown all out of proportion. |
Don't worry, I suggested you early on to critically self-review it with an authority of your choice. I am still waiting for a concrete use case example that motivates all this. It is hard to follow at all why this is worth spending time on now. Your only concrete use case is built on the wrong premise that your current C++ YAML lib does not support it, but it in fact does preserve order mapping and you have a bug in |
|
Sorry, clicked the wrong button.
I designed concrete cases above that prevent better support by existing structural serialization/deserialization and falls back for us "writing all the validation (structural and physics/PALS meaning) from scratch". This is not how we do this; for modern standards we want to rely on structural schemas, diff tools, declarative validation schemes, etc.. One builds on the other, we do not mix parsing YAML, doing schema validation, and doing physics mapping into the same levels. I stand behind all problems 1-4 and want to see them either solved or strongly motivated why it is worth to add this. I am repeating myself: for such structural changes, please lead with concrete examples/needs that make this necessary, not a vague "in the future / maybe [no concrete case]". |
jlvay
left a comment
There was a problem hiding this comment.
After finally understanding what this PR was doing and reviewing, I think that the proposed option would bring confusion and additional work for the parsers that is not needed.
|
Decided to close after extensive discussions. |
Clarified that ordered list with unique keys can be used in place of a unordered dictionary.
Rationale: A PALS parser may not always be able to preserve insertion order when it reads in a dict. And it is sometimes desirable to preserve the order. In particular, the C++ parser being developed does not have this capability since it uses an external parsing library that does not have this feature. (*)
Correction added by @ax3l: that premise (*) is wrong. The currently picked library (and many other C++ libs for YAML, JSON, etc.) do in fact support preservation of insertion order. This is a bug in pals-cpp (and here is the fix); it is not a library issue and not a fundamental limitation.