This is the canonical manual for usage, API, selector behavior, performance workflow, conformance expectations, and internals.
- Requirements
- Quick Start
- Core API
- Selector Support
- Mode Guidance
- Performance and Benchmarks
- Latest Benchmark Snapshot
- Conformance Status
- Architecture
- Troubleshooting
- Zig
0.15.2 - Mutable input buffers (
[]u8) for parsing
const std = @import("std");
const html = @import("htmlparser");
const options: html.ParseOptions = .{};
const Document = options.GetDocument();
test "basic parse + query" {
var doc = Document.init(std.testing.allocator);
defer doc.deinit();
var input = "<div id='app'><a class='nav' href='/docs'>Docs</a></div>".*;
try doc.parse(&input, .{});
const a = doc.queryOne("div#app > a.nav") orelse return error.TestUnexpectedResult;
try std.testing.expectEqualStrings("/docs", a.getAttributeValue("href").?);
}Source examples:
examples/basic_parse_query.zigexamples/query_time_decode.zig
All examples are verified by running zig build examples-check
const opts: ParseOptions = .{};const Document = opts.GetDocument();Document.init(allocator)doc.deinit()doc.clear()doc.parse(input: []u8, comptime opts: ParseOptions)
- Compile-time selectors:
doc.queryOne(comptime selector)doc.queryAll(comptime selector)
- Runtime selectors:
try doc.queryOneRuntime(selector)try doc.queryAllRuntime(selector)
- Cached runtime selectors:
doc.queryOneCached(&selector)doc.queryAllCached(&selector)- selector created via
try Selector.compileRuntime(allocator, source)
- Diagnostics:
doc.queryOneDebug(comptime selector, report)try doc.queryOneRuntimeDebug(selector, report)
- Navigation:
tagName()parentNode()firstChild()lastChild()nextSibling()prevSibling()children()(borrowed[]const u32index view)
- Text:
innerText(allocator)(borrowed or allocated depending on shape)innerTextWithOptions(allocator, TextOptions)innerTextOwned(allocator)(always allocated)innerTextOwnedWithOptions(allocator, TextOptions)
- Attributes:
getAttributeValue(name)
- Scoped queries:
- same query family as
Document(queryOne/queryAll, runtime, cached, debug)
- same query family as
doc.html(),doc.head(),doc.body()doc.isOwned(slice)to check whether a slice points into document source bytes
ParseOptionseager_child_views: bool = truedrop_whitespace_text_nodes: bool = false
TextOptionsnormalize_whitespace: bool = true
- parse/query work split:
- parse keeps raw text and attribute spans in-place
- entity decode and whitespace normalization are applied by query-time APIs (
getAttributeValue,innerText*, selector attribute predicates)
parseWithHooks(doc, input, opts, hooks)queryOneRuntimeWithHooks(doc, selector, hooks)queryOneCachedWithHooks(doc, selector, hooks)queryAllRuntimeWithHooks(doc, selector, hooks)queryAllCachedWithHooks(doc, selector, hooks)
Supported selectors:
- tag selectors and universal
* #id,.class- attributes:
[a],[a=v],[a^=v],[a$=v],[a*=v],[a~=v],[a|=v]
- combinators:
- descendant (
a b) - child (
a > b) - adjacent sibling (
a + b) - general sibling (
a ~ b)
- descendant (
- grouping:
a, b, c - pseudo-classes:
:first-child:last-child:nth-child(An+B)withodd/evenand forms like3n+1,+3n-2,-n+6:not(...)(simple selector payload)
- parser guardrails:
- multiple
#idpredicates in one compound (for example#a#b) are rejected as invalid
- multiple
Compilation modes:
- comptime selectors fail at compile time when invalid
- runtime selectors return
error.InvalidSelector
htmlparser is permissive by design. Choose parse options by workload:
| Mode | Parse Options | Best For | Tradeoffs |
|---|---|---|---|
strictest |
.eager_child_views = true, .drop_whitespace_text_nodes = false |
traversal predictability and text fidelity | higher parse-time work |
fastest |
.eager_child_views = false, .drop_whitespace_text_nodes = true |
throughput-first scraping | whitespace-only text nodes dropped; child views built lazily |
Fallback playbook:
- Start with
fastestfor bulk workloads. - Move unstable domains to
strictest. - Use
queryOneRuntimeDebugandQueryDebugReportbefore changing selectors.
Run benchmarks:
zig build bench-compare
zig build tools -- run-benchmarks --profile quick
zig build tools -- run-benchmarks --profile stableArtifacts:
bench/results/latest.mdbench/results/latest.json
Benchmark policy:
- parse comparisons include
strlen,lexbor, and parse-onlylol-html - query parse/match/cached sections benchmark
htmlparser - repeated runtime selector workloads should use cached selectors
Warning: throughput numbers are not conformance claims. This parser is permissive by design; see Conformance Status.
Source: bench/results/latest.json (stable profile).
| Fixture | ours | lol-html | lexbor |
|---|---|---|---|
rust-lang.html |
1447.99 | 1474.65 | 332.72 |
wiki-html.html |
1645.45 | 1215.04 | 271.24 |
mdn-html.html |
2570.09 | 1879.00 | 404.50 |
w3-html52.html |
1064.19 | 764.62 | 199.22 |
hn.html |
1263.60 | 885.26 | 223.15 |
python-org.html |
1549.02 | 1356.21 | 284.19 |
kernel-org.html |
1440.47 | 1300.81 | 276.52 |
gnu-org.html |
1917.36 | 1482.15 | 317.74 |
ziglang-org.html |
1480.49 | 1257.62 | 291.72 |
ziglang-doc-master.html |
1122.44 | 987.16 | 214.23 |
wikipedia-unicode-list.html |
1247.00 | 1024.98 | 215.21 |
whatwg-html-spec.html |
1113.73 | 841.16 | 210.83 |
synthetic-forms.html |
1046.17 | 710.72 | 174.94 |
synthetic-table-grid.html |
768.56 | 622.31 | 152.86 |
synthetic-list-nested.html |
833.77 | 598.02 | 152.45 |
synthetic-comments-doctype.html |
1200.72 | 827.66 | 212.09 |
synthetic-template-rich.html |
628.02 | 444.34 | 134.10 |
synthetic-whitespace-noise.html |
1104.21 | 919.69 | 170.33 |
synthetic-news-feed.html |
835.27 | 577.95 | 144.46 |
synthetic-ecommerce.html |
787.72 | 556.51 | 151.95 |
synthetic-forum-thread.html |
839.48 | 579.84 | 143.06 |
| Case | ours ops/s | ours ns/op |
|---|---|---|
attr-heavy-button |
1148936.76 | 870.37 |
attr-heavy-nav |
1130790.00 | 884.34 |
| Case | ours ops/s | ours ns/op |
|---|---|---|
attr-heavy-button |
1305257.78 | 766.13 |
attr-heavy-nav |
1347173.46 | 742.29 |
| Selector case | Ops/s | ns/op |
|---|---|---|
simple |
17335919.85 | 57.68 |
complex |
5836657.49 | 171.33 |
grouped |
6396371.26 | 156.34 |
For full per-parser, per-fixture tables and gate output:
bench/results/latest.mdbench/results/latest.json
Run conformance suites:
zig build conformance
# or
zig build tools -- run-external-suites --mode bothArtifact: bench/results/external_suite_report.json
Tracked suites:
- selector suites:
nwmatcher,qwery_contextual - parser suites:
- html5lib tree-construction subset
- WHATWG HTML parsing corpus (via WPT
html/syntax/parsing/html5lib_*.html)
Fetched suite repos are cached under bench/.cache/suites/ (gitignored).
Core modules:
src/html/parser.zig: permissive parse pipelinesrc/html/scanner.zig: byte-scanning hot-path helperssrc/html/tags.zig: tag metadata and hash dispatchsrc/html/attr_inline.zig: in-place attribute traversal/lazy materializationsrc/html/entities.zig: entity decode utilitiessrc/selector/runtime.zig,src/selector/compile_time.zig: selector parsingsrc/selector/matcher.zig: selector matching/combinator traversal
Data model highlights:
Documentowns source bytes and node/index storage- nodes are contiguous and linked by indexes for traversal
- attributes are traversed directly from source spans (no heap attribute objects)
- validate selector syntax (
queryOneRuntimecan returnerror.InvalidSelector) - check scope (
Documentvs scopedNode) - use
queryOneRuntimeDebugand inspectQueryDebugReport
- default
innerTextnormalizes whitespace - use
innerTextWithOptions(..., .{ .normalize_whitespace = false })for raw spacing - use
innerTextOwned(...)when output must always be allocated - use
doc.isOwned(slice)to check borrowed vs allocated
queryAllRuntime iterators are invalidated by newer queryAllRuntime calls on the same Document.
Expected: parse and lazy decode paths mutate source bytes in place.