MDEV-18359, MDEV-26905: Fix invalid XML in charsets Index.xml#4873
Draft
FaramosCZ wants to merge 1 commit intoMariaDB:mainfrom
Draft
MDEV-18359, MDEV-26905: Fix invalid XML in charsets Index.xml#4873FaramosCZ wants to merge 1 commit intoMariaDB:mainfrom
FaramosCZ wants to merge 1 commit intoMariaDB:mainfrom
Conversation
ef6e5ee to
42b7f8a
Compare
e687670 to
47aafdf
Compare
Summary:
The charset definition files sql/share/charsets/Index.xml and
mysql-test/std_data/ldml/Index.xml contained duplicate "flag" attributes
on single <collation> elements, violating XML well-formedness rules.
Standard XML parsers (xmllint, libxml2, etc.) reject duplicate attributes,
making these files unparseable by any spec-compliant tool.
Root Cause:
When nopad_bin collations were added, their flags were specified as
XML attributes: flag="binary" flag="nopad". The XML specification
(Section 3.1, Well-Formedness Constraint: Unique Att Spec) prohibits
duplicate attribute names on a single element. MariaDB's custom XML
parser in strings/xml.c happened to process both duplicates because
it handles attributes sequentially in a while loop, but this is
non-standard behavior that breaks interoperability with standard
XML tooling.
What the patch does:
Converts all 24 occurrences of duplicate flag attributes from
self-closing elements with duplicate attributes to elements with
child <flag> nodes. This follows the existing pattern already used
by many collations in the same file (e.g., big5_chinese_ci,
latin1_swedish_ci, utf8mb3_general_ci).
Before (invalid XML):
<collation name="latin2_nopad_bin" id="1101" flag="binary" flag="nopad"/>
After (valid XML):
<collation name="latin2_nopad_bin" id="1101">
<flag>binary</flag>
<flag>nopad</flag>
</collation>
No C code changes are required. The _CS_FLAG handler in
strings/ctype.c (around line 621) already processes <flag> child
elements using bitwise OR (|=) to accumulate flags, so both "binary"
(MY_CS_BINSORT) and "nopad" (MY_CS_NOPAD) flags are correctly applied.
Files modified:
- sql/share/charsets/Index.xml (23 collations fixed)
- mysql-test/std_data/ldml/Index.xml (1 collation fixed)
Complete list of 24 collations fixed:
sql/share/charsets/Index.xml:
1. latin2_nopad_bin (id=1101)
2. dec8_nopad_bin (id=1093)
3. cp850_nopad_bin (id=1104)
4. hp8_nopad_bin (id=1096)
5. koi8r_nopad_bin (id=1098)
6. swe7_nopad_bin (id=1106)
7. ascii_nopad_bin (id=1089)
8. cp1251_nopad_bin (id=1074)
9. hebrew_nopad_bin (id=1095)
10. latin7_nopad_bin (id=1103)
11. koi8u_nopad_bin (id=1099)
12. greek_nopad_bin (id=1094)
13. cp1250_nopad_bin (id=1090)
14. cp1257_nopad_bin (id=1082)
15. latin5_nopad_bin (id=1102)
16. armscii8_nopad_bin (id=1088)
17. cp866_nopad_bin (id=1092)
18. keybcs2_nopad_bin (id=1097)
19. macce_nopad_bin (id=1067)
20. macroman_nopad_bin (id=1077)
21. cp852_nopad_bin (id=1105)
22. cp1256_nopad_bin (id=1091)
23. geostd8_nopad_bin (id=1117)
mysql-test/std_data/ldml/Index.xml:
24. ascii2_nopad_bin (id=325)
Validation:
- xmllint --noout passes cleanly on both files after the fix
- Zero duplicate flag attributes remain (verified with grep)
- The fix is consistent with the existing pattern used by other
collations in the same files
Co-Authored-By: Claude AI <noreply@anthropic.com>
47aafdf to
f6111fd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The charset definition files sql/share/charsets/Index.xml and mysql-test/std_data/ldml/Index.xml contained duplicate "flag" attributes on single elements, violating XML well-formedness rules. Standard XML parsers (xmllint, libxml2, etc.) reject duplicate attributes, making these files unparseable by any spec-compliant tool.
Root Cause:
When nopad_bin collations were added, their flags were specified as XML attributes: flag="binary" flag="nopad". The XML specification (Section 3.1, Well-Formedness Constraint: Unique Att Spec) prohibits duplicate attribute names on a single element. MariaDB's custom XML parser in strings/xml.c happened to process both duplicates because it handles attributes sequentially in a while loop, but this is non-standard behavior that breaks interoperability with standard XML tooling.
What the patch does:
Converts all 24 occurrences of duplicate flag attributes from self-closing elements with duplicate attributes to elements with child nodes. This follows the existing pattern already used by many collations in the same file (e.g., big5_chinese_ci, latin1_swedish_ci, utf8mb3_general_ci).
Before (invalid XML):
After (valid XML):
binary
nopad
No C code changes are required. The _CS_FLAG handler in strings/ctype.c (around line 621) already processes child elements using bitwise OR (|=) to accumulate flags, so both "binary" (MY_CS_BINSORT) and "nopad" (MY_CS_NOPAD) flags are correctly applied.
Files modified:
Complete list of 24 collations fixed:
sql/share/charsets/Index.xml:
mysql-test/std_data/ldml/Index.xml:
24. ascii2_nopad_bin (id=325)
Validation: