Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Contributing to htmlparser

## Adding new elements

When adding new elements to the parser, you must regenerate the element name hash tables in `src/nu/validator/htmlparser/impl/ElementName.java`.

### Step 1: Add the new element constant

Add a new `static final ElementName` constant for your element, following the existing pattern:

```java
public static final ElementName MYNEWELEMENT = new ElementName(
"mynewelement", "mynewelement",
// CPPONLY: NS_NewHTMLElement,
// CPPONLY: NS_NewSVGUnknownElement,
TreeBuilder.OTHER);
```

The flags (like `TreeBuilder.OTHER`, `SPECIAL`, `SCOPING`, etc.) depend on how the element should be handled by the tree builder.

### Step 2: Uncomment the code generation sections

Uncomment three sections in `ElementName.java`:

1. **The imports** near the top (~lines 26-39):
- `java.io.*`
- `java.util.*`
- `java.util.regex.*`

2. **`implements Comparable<ElementName>`** on the class declaration (~line 49)

3. **The code generation block** marked with:
`"START CODE ONLY USED FOR GENERATING CODE uncomment and run to regenerate"`
That includes the `main()` method and helper functions (~lines 272-659)

### Step 3: Add case to treeBuilderGroupToName() if needed

If your element uses a new `TreeBuilder` group constant, add a case for it in the `treeBuilderGroupToName()` method within the code generation block.

### Step 4: Compile and run

Compile the project:

```bash
mvn compile
```

Run the `ElementName` class with paths to the Gecko tag-list files:

```bash
java -cp target/classes nu.validator.htmlparser.impl.ElementName \
/path/to/nsHTMLTagList.h \
/path/to/SVGTagList.h
```

**For Java-only builds** (not Gecko), you can use empty dummy files:

```bash
mkdir -p /tmp/tagfiles
touch /tmp/tagfiles/nsHTMLTagList.h /tmp/tagfiles/SVGTagList.h
java -cp target/classes nu.validator.htmlparser.impl.ElementName \
/tmp/tagfiles/nsHTMLTagList.h \
/tmp/tagfiles/SVGTagList.h
```

> [!NOTE]
> Using empty files means the `CPPONLY` comments will all show `NS_NewHTMLUnknownElement`. For Gecko builds, use the actual files from moz-central:
> - `parser/htmlparser/nsHTMLTagList.h`
> - `dom/svg/SVGTagList.h`

### Step 5: Update the generated arrays

The program outputs:
1. All element constant definitions (with updated `CPPONLY` comments if using real Gecko tag files)
2. The `ELEMENT_NAMES` array in level-order binary search tree order
3. The `ELEMENT_HASHES` array with corresponding hash values

Replace the existing `ELEMENT_NAMES` and `ELEMENT_HASHES` arrays in the file with the generated output. The arrays must stay in sync—element at position N in `ELEMENT_NAMES` must have its hash at position N in `ELEMENT_HASHES`.

### Step 6: Re-comment the code generation sections

After regeneration, comment out the sections you uncommented in Step 2 to restore the file to its normal state.

### Step 7: Run tests

Verify your changes work correctly:

```bash
mvn test
```

### Technical Details

The hash function (`bufToHash`) creates a unique integer for each element name using the element's length and specific character positions. The arrays are organized as a level-order binary search tree for O(log n) lookup performance.

If you encounter a hash collision (two elements with the same hash), the regeneration will report an error. That would require modifying the hash function, which has not been necessary historically.
2 changes: 1 addition & 1 deletion test-src/test/resources/html5lib-tests
Submodule html5lib-tests updated 60 files
+76 −0 .github/workflows/downstream.yml
+25 −0 .github/workflows/lint.yml
+79 −0 .gitignore
+6 −0 lint
+0 −0 lint_lib/__init__.py
+24 −0 lint_lib/_vendor-patches/funcparserlib.patch
+0 −0 lint_lib/_vendor/__init__.py
+18 −0 lint_lib/_vendor/funcparserlib/LICENSE
+0 −0 lint_lib/_vendor/funcparserlib/__init__.py
+211 −0 lint_lib/_vendor/funcparserlib/lexer.py
+34 −0 lint_lib/_vendor/funcparserlib/lexer.pyi
+872 −0 lint_lib/_vendor/funcparserlib/parser.py
+83 −0 lint_lib/_vendor/funcparserlib/parser.pyi
+0 −0 lint_lib/_vendor/funcparserlib/py.typed
+72 −0 lint_lib/_vendor/funcparserlib/util.py
+7 −0 lint_lib/_vendor/funcparserlib/util.pyi
+1 −0 lint_lib/_vendor/vendor.txt
+280 −0 lint_lib/lint.py
+177 −0 lint_lib/parser.py
+7 −0 pyproject.toml
+2 −2 serializer/core.test
+6 −1 tokenizer/domjs.test
+4 −0 tokenizer/test1.test
+1 −1 tokenizer/test2.test
+8 −8 tokenizer/test3.test
+0 −24 tree-construction/blocks.dat
+4 −11 tree-construction/comments01.dat
+5 −1 tree-construction/doctype01.dat
+2 −2 tree-construction/entities02.dat
+109 −23 tree-construction/foreign-fragment.dat
+0 −1 tree-construction/html5test-com.dat
+23 −0 tree-construction/math.dat
+2 −18 tree-construction/menuitem-element.dat
+6 −0 tree-construction/namespace-sensitivity.dat
+ tree-construction/plain-text-unsafe.dat
+53 −0 tree-construction/quirks01.dat
+1 −0 tree-construction/ruby.dat
+0 −13 tree-construction/scriptdata01.dat
+46 −0 tree-construction/search-element.dat
+23 −0 tree-construction/svg.dat
+51 −14 tree-construction/tables01.dat
+86 −17 tree-construction/template.dat
+15 −45 tree-construction/tests1.dat
+25 −21 tree-construction/tests10.dat
+0 −2 tree-construction/tests16.dat
+4 −6 tree-construction/tests17.dat
+81 −43 tree-construction/tests18.dat
+0 −56 tree-construction/tests19.dat
+11 −1 tree-construction/tests2.dat
+265 −5 tree-construction/tests20.dat
+6 −25 tree-construction/tests21.dat
+60 −0 tree-construction/tests26.dat
+16 −0 tree-construction/tests4.dat
+1 −1 tree-construction/tests6.dat
+41 −6 tree-construction/tests7.dat
+3 −0 tree-construction/tests8.dat
+31 −21 tree-construction/tests9.dat
+5 −48 tree-construction/tests_innerHTML_1.dat
+45 −17 tree-construction/webkit01.dat
+472 −0 tree-construction/webkit02.dat