Skip to content
/ server Public

MDEV-18359, MDEV-26905: Fix invalid XML in charsets Index.xml#4873

Draft
FaramosCZ wants to merge 1 commit intoMariaDB:mainfrom
FaramosCZ:MDEV-18359+MDEV-26905
Draft

MDEV-18359, MDEV-26905: Fix invalid XML in charsets Index.xml#4873
FaramosCZ wants to merge 1 commit intoMariaDB:mainfrom
FaramosCZ:MDEV-18359+MDEV-26905

Conversation

@FaramosCZ
Copy link
Copy Markdown
Contributor

Summary:
The charset definition files sql/share/charsets/Index.xml and mysql-test/std_data/ldml/Index.xml contained duplicate "flag" attributes on single elements, violating XML well-formedness rules. Standard XML parsers (xmllint, libxml2, etc.) reject duplicate attributes, making these files unparseable by any spec-compliant tool.

Root Cause:
When nopad_bin collations were added, their flags were specified as XML attributes: flag="binary" flag="nopad". The XML specification (Section 3.1, Well-Formedness Constraint: Unique Att Spec) prohibits duplicate attribute names on a single element. MariaDB's custom XML parser in strings/xml.c happened to process both duplicates because it handles attributes sequentially in a while loop, but this is non-standard behavior that breaks interoperability with standard XML tooling.

What the patch does:
Converts all 24 occurrences of duplicate flag attributes from self-closing elements with duplicate attributes to elements with child nodes. This follows the existing pattern already used by many collations in the same file (e.g., big5_chinese_ci, latin1_swedish_ci, utf8mb3_general_ci).

Before (invalid XML):

After (valid XML):

binary
nopad

No C code changes are required. The _CS_FLAG handler in strings/ctype.c (around line 621) already processes child elements using bitwise OR (|=) to accumulate flags, so both "binary" (MY_CS_BINSORT) and "nopad" (MY_CS_NOPAD) flags are correctly applied.

Files modified:

  • sql/share/charsets/Index.xml (23 collations fixed)
  • mysql-test/std_data/ldml/Index.xml (1 collation fixed)

Complete list of 24 collations fixed:

sql/share/charsets/Index.xml:

  1. latin2_nopad_bin (id=1101)
  2. dec8_nopad_bin (id=1093)
  3. cp850_nopad_bin (id=1104)
  4. hp8_nopad_bin (id=1096)
  5. koi8r_nopad_bin (id=1098)
  6. swe7_nopad_bin (id=1106)
  7. ascii_nopad_bin (id=1089)
  8. cp1251_nopad_bin (id=1074)
  9. hebrew_nopad_bin (id=1095)
  10. latin7_nopad_bin (id=1103)
  11. koi8u_nopad_bin (id=1099)
  12. greek_nopad_bin (id=1094)
  13. cp1250_nopad_bin (id=1090)
  14. cp1257_nopad_bin (id=1082)
  15. latin5_nopad_bin (id=1102)
  16. armscii8_nopad_bin (id=1088)
  17. cp866_nopad_bin (id=1092)
  18. keybcs2_nopad_bin (id=1097)
  19. macce_nopad_bin (id=1067)
  20. macroman_nopad_bin (id=1077)
  21. cp852_nopad_bin (id=1105)
  22. cp1256_nopad_bin (id=1091)
  23. geostd8_nopad_bin (id=1117)

mysql-test/std_data/ldml/Index.xml:
24. ascii2_nopad_bin (id=325)

Validation:

  • xmllint --noout passes cleanly on both files after the fix
  • Zero duplicate flag attributes remain (verified with grep)
  • The fix is consistent with the existing pattern used by other collations in the same files

@FaramosCZ FaramosCZ force-pushed the MDEV-18359+MDEV-26905 branch from ef6e5ee to 42b7f8a Compare March 27, 2026 12:28
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Mar 27, 2026
@FaramosCZ FaramosCZ force-pushed the MDEV-18359+MDEV-26905 branch 2 times, most recently from e687670 to 47aafdf Compare March 28, 2026 08:57
Summary:
The charset definition files sql/share/charsets/Index.xml and
mysql-test/std_data/ldml/Index.xml contained duplicate "flag" attributes
on single <collation> elements, violating XML well-formedness rules.
Standard XML parsers (xmllint, libxml2, etc.) reject duplicate attributes,
making these files unparseable by any spec-compliant tool.

Root Cause:
When nopad_bin collations were added, their flags were specified as
XML attributes: flag="binary" flag="nopad". The XML specification
(Section 3.1, Well-Formedness Constraint: Unique Att Spec) prohibits
duplicate attribute names on a single element. MariaDB's custom XML
parser in strings/xml.c happened to process both duplicates because
it handles attributes sequentially in a while loop, but this is
non-standard behavior that breaks interoperability with standard
XML tooling.

What the patch does:
Converts all 24 occurrences of duplicate flag attributes from
self-closing elements with duplicate attributes to elements with
child <flag> nodes. This follows the existing pattern already used
by many collations in the same file (e.g., big5_chinese_ci,
latin1_swedish_ci, utf8mb3_general_ci).

Before (invalid XML):
  <collation name="latin2_nopad_bin" id="1101" flag="binary" flag="nopad"/>

After (valid XML):
  <collation name="latin2_nopad_bin" id="1101">
    <flag>binary</flag>
    <flag>nopad</flag>
  </collation>

No C code changes are required. The _CS_FLAG handler in
strings/ctype.c (around line 621) already processes <flag> child
elements using bitwise OR (|=) to accumulate flags, so both "binary"
(MY_CS_BINSORT) and "nopad" (MY_CS_NOPAD) flags are correctly applied.

Files modified:
- sql/share/charsets/Index.xml (23 collations fixed)
- mysql-test/std_data/ldml/Index.xml (1 collation fixed)

Complete list of 24 collations fixed:

sql/share/charsets/Index.xml:
 1. latin2_nopad_bin     (id=1101)
 2. dec8_nopad_bin       (id=1093)
 3. cp850_nopad_bin      (id=1104)
 4. hp8_nopad_bin        (id=1096)
 5. koi8r_nopad_bin      (id=1098)
 6. swe7_nopad_bin       (id=1106)
 7. ascii_nopad_bin      (id=1089)
 8. cp1251_nopad_bin     (id=1074)
 9. hebrew_nopad_bin     (id=1095)
10. latin7_nopad_bin     (id=1103)
11. koi8u_nopad_bin      (id=1099)
12. greek_nopad_bin      (id=1094)
13. cp1250_nopad_bin     (id=1090)
14. cp1257_nopad_bin     (id=1082)
15. latin5_nopad_bin     (id=1102)
16. armscii8_nopad_bin   (id=1088)
17. cp866_nopad_bin      (id=1092)
18. keybcs2_nopad_bin    (id=1097)
19. macce_nopad_bin      (id=1067)
20. macroman_nopad_bin   (id=1077)
21. cp852_nopad_bin      (id=1105)
22. cp1256_nopad_bin     (id=1091)
23. geostd8_nopad_bin    (id=1117)

mysql-test/std_data/ldml/Index.xml:
24. ascii2_nopad_bin     (id=325)

Validation:
- xmllint --noout passes cleanly on both files after the fix
- Zero duplicate flag attributes remain (verified with grep)
- The fix is consistent with the existing pattern used by other
  collations in the same files

Co-Authored-By: Claude AI <noreply@anthropic.com>
@FaramosCZ FaramosCZ force-pushed the MDEV-18359+MDEV-26905 branch from 47aafdf to f6111fd Compare March 28, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

2 participants