Skip to content

Commit 39508b5

Browse files
authored
v0.2.0: SIMD-accelerated XML parser for Python
Tag name interning, lazy ElementList, zero-copy parse, upstream API integration. 2-42x faster than lxml across parse/XPath/traversal benchmarks. 191 tests, pyright clean, ruff clean.
1 parent dd1ae1a commit 39508b5

8 files changed

Lines changed: 951 additions & 407 deletions

File tree

Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "simdxml-python"
3-
version = "0.1.0"
3+
version = "0.2.0"
44
edition = "2021"
55

66
[lib]
@@ -9,5 +9,5 @@ crate-type = ["cdylib"]
99

1010
[dependencies]
1111
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39"] }
12-
simdxml = "0.1"
12+
simdxml = "0.2"
1313
self_cell = "1"

README.md

Lines changed: 52 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ elem.getprevious() # previous sibling or None
6262
elem.xpath(".//title") # context-node evaluation
6363
elem.xpath_text("author") # text extraction from context
6464

65+
# Batch APIs (single FFI call, interned strings)
66+
root.child_tags() # -> list[str] of child tag names
67+
root.descendant_tags("item") # -> list[str] filtered by tag
68+
6569
# Compiled XPath (like re.compile)
6670
expr = simdxml.compile("//title")
6771
expr.eval_text(doc) # -> list[str]
@@ -118,33 +122,59 @@ Full conformance with XPath 1.0:
118122

119123
## Benchmarks
120124

121-
Measured on Apple Silicon (M-series), Python 3.14, comparing against
122-
lxml 6.0 and stdlib `xml.etree.ElementTree`. Run with `uv run python bench/bench_parse.py`.
125+
Apple Silicon, Python 3.14, lxml 6.0. GC disabled during timing, 3 warmup +
126+
20 timed iterations, median reported. Three corpus types: data-oriented
127+
(product catalog), document-oriented (PubMed abstracts), config-oriented
128+
(Maven POM). Run yourself: `uv run python bench/bench_parse.py`
129+
130+
### Parse
131+
132+
`simdxml.parse()` eagerly builds structural indices (CSR, name posting).
133+
lxml's `fromstring()` builds a DOM tree without precomputed query indices.
134+
simdxml front-loads more work into parse so queries are faster — both numbers
135+
are real, the trade-off depends on your workload.
136+
137+
| Corpus | Size | simdxml | lxml | vs lxml | vs stdlib |
138+
|--------|------|---------|------|---------|-----------|
139+
| Catalog (data) | 1.6 MB | 2.7 ms | 8.1 ms | **3.0x** | **5.4x** |
140+
| Catalog (data) | 17 MB | 32 ms | 82 ms | **2.6x** | **4.7x** |
141+
| PubMed (doc) | 1.7 MB | 2.3 ms | 6.0 ms | **2.7x** | **5.9x** |
142+
| PubMed (doc) | 17 MB | 27 ms | 61 ms | **2.2x** | **5.0x** |
143+
| POM (config) | 2.1 MB | 2.7 ms | 8.3 ms | **3.1x** | **6.6x** |
144+
145+
### XPath queries (returning Elements — apples-to-apples)
123146

124-
### Parse throughput
147+
| Query | Corpus | simdxml | lxml | vs lxml |
148+
|-------|--------|---------|------|---------|
149+
| `//item` | Catalog 17 MB | 3.4 ms | 21 ms | **6x** |
150+
| `//item[@category="cat5"]` | Catalog 17 MB | 1.6 ms | 69 ms | **42x** |
151+
| `//PubmedArticle` | PubMed 17 MB | 0.35 ms | 9.8 ms | **28x** |
152+
| `//Author[LastName="Auth0_0"]` | PubMed 17 MB | 13 ms | 29 ms | **2.2x** |
153+
| `//dependency` | POM 2.1 MB | 0.34 ms | 1.1 ms | **3.3x** |
154+
| `//dependency[scope="test"]` | POM 2.1 MB | 2.4 ms | 3.6 ms | **1.5x** |
125155

126-
| Document | simdxml | lxml | stdlib ET | vs lxml | vs stdlib |
127-
|----------|---------|------|-----------|---------|-----------|
128-
| 20 KB (100 items) | 0.05 ms | 0.09 ms | 0.15 ms | 1.8x | 3.0x |
129-
| 2 MB (10K items) | 3.3 ms | 8.5 ms | 16.7 ms | 2.6x | 5.0x |
130-
| 20 MB (100K items) | 40 ms | 87 ms | 181 ms | **2.2x** | **4.5x** |
156+
### XPath text extraction
131157

132-
### XPath query: `//name`
158+
`xpath_text()` returns strings directly, avoiding Element object creation.
159+
This is the optimized path for ETL / data extraction workloads.
133160

134-
| Document | simdxml | lxml | stdlib findall | vs lxml | vs stdlib |
135-
|----------|---------|------|----------------|---------|-----------|
136-
| 2 MB | 0.3 ms | 1.0 ms | 0.7 ms | 3.1x | 2.1x |
137-
| 20 MB | 3.8 ms | 19.7 ms | 7.3 ms | **5.2x** | **1.9x** |
161+
| Query | Corpus | simdxml | lxml xpath+.text | vs lxml |
162+
|-------|--------|---------|------------------|---------|
163+
| `//name` | Catalog 17 MB | 1.8 ms | 37 ms | **20x** |
164+
| `//AbstractText` | PubMed 17 MB | 0.31 ms | 7.1 ms | **23x** |
165+
| `//artifactId` | POM 2.1 MB | 0.21 ms | 2.0 ms | **10x** |
138166

139-
### XPath query with predicate: `//item[@category="cat5"]`
167+
### Element traversal
140168

141-
| Document | simdxml | lxml | stdlib findall | vs lxml |
142-
|----------|---------|------|----------------|---------|
143-
| 2 MB | 0.2 ms | 2.8 ms | 0.8 ms | 16x |
144-
| 20 MB | 2.0 ms | 46 ms | 9.1 ms | **23x** |
169+
`child_tags()` and `descendant_tags()` return all tag names in a single
170+
call using interned Python strings. Per-element iteration (`for e in root`)
171+
is also available but creates Element objects with some overhead.
145172

146-
The predicate speedup is dramatic because simdxml's structural index enables
147-
direct attribute comparison without materializing DOM nodes.
173+
| Corpus | `child_tags()` | lxml `[e.tag]` | vs lxml |
174+
|--------|----------------|-----------------|---------|
175+
| Catalog 17 MB | **0.38 ms** | 6.4 ms | **17x** |
176+
| PubMed 17 MB | **0.03 ms** | 0.60 ms | **17x** |
177+
| POM 2.1 MB | **0.2 us** | 0.5 us | **3x** |
148178

149179
## How it works
150180

@@ -157,7 +187,8 @@ and parents -- all indexed by the same position.
157187
- O(1) ancestor/descendant checks via pre/post-order numbering
158188
- O(1) child enumeration via CSR (Compressed Sparse Row) indices
159189
- SIMD-accelerated structural parsing (NEON on ARM, AVX2 on x86)
160-
- Lazy index building: CSR indices built on first query, not at parse time
190+
- Parse eagerly builds all indices (CSR, name posting, parent map) so
191+
subsequent queries pay zero index construction cost
161192

162193
## Platform support
163194

0 commit comments

Comments
 (0)