Skip to content

fix(chunking): preserve semantic headers in carried table chunks#4313

Open
cragwolfe wants to merge 10 commits intomainfrom
crag/preserve-headers-in-table-chunks
Open

fix(chunking): preserve semantic headers in carried table chunks#4313
cragwolfe wants to merge 10 commits intomainfrom
crag/preserve-headers-in-table-chunks

Conversation

@cragwolfe
Copy link
Copy Markdown
Contributor

@cragwolfe cragwolfe commented Apr 2, 2026

Second-Round Problem Statement

The first round restored carried-header semantics (<thead>/<th>) for continuation chunks, but follow-up analysis found a reconstruction edge case: canonical <thead> synthesis during merged-table reconstruction could still trigger when carried header rows did not correspond to chunk-0 leading rows. In those mismatched scenarios, reconstruction could over-normalize header structure and risk row-order/semantic drift.

Boundary Rationale

Boundary decisions were intentionally kept narrow:

  1. Keep ordinary compactified table/body contracts unchanged (HtmlCell.html, HtmlRow.html, HtmlTable.html, header-count semantics).
  2. Preserve source row HTML only as additive state for carried-header semantics.
  3. Restrict canonical <thead> reconstruction to cases where carried rows match chunk-0 leading rows.

Why this boundary:

  • Fixes semantic-preservation gaps without broadening compactification behavior.
  • Avoids blast radius in non-header body chunking paths.
  • Keeps reconstruction deterministic by gating on row-signature agreement.

Implementation Summary

Follow-up commits on this branch (ba8241cf, eb39dacf, c33e6bce, 14b947c9) do the following:

  1. unstructured/common/html_table.py
  • Preserve pre-compactification row HTML (source_row_htmls) and thread it into HtmlRow(source_html=...) as additive metadata.
  1. unstructured/chunking/base.py
  • Update carried-header serialization to prefer row.source_html (fallback row.html).
  • Convert only direct-child <td> tags to <th> for carried headers, preserving attributes and nested subtree content.
  1. unstructured/chunking/dispatch.py
  • Add canonical <thead> synthesis guardrails so reconstruction uses carried rows only when they match chunk-0 leading-row signatures.
  • Prevent synthetic canonical headers when carried rows are mismatched.
  1. Regression coverage
  • Expanded test_unstructured/chunking/test_base.py for attribute preservation, non-text cell preservation, canonical <thead> gating, and carried-header stability.
  • Added test_unstructured/common/test_html_table.py coverage for preserved source-row HTML plumbing.

Validation Evidence

Executed for the second-round follow-up:

# Harness init sanity suite
./init.sh

# Repo gates and focused/regression suites
make check
uv run pytest test_unstructured/chunking/test_base.py -k "preserves_header_semantics_on_carried_header_rows or preserves_source_header_row_html_for_carried_rows or preserves_non_text_only_carried_header_cells or keeps_compactified_contracts_for_non_header_body_cells or reconstructs_a_single_canonical_thead_for_carried_headers or preserves_header_attributes_in_reconstructed_canonical_thead"
uv run pytest test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers
uv run pytest test_unstructured/chunking/test_base.py -k "carried_over_header_rows or carried_header or canonical_thead or preserves_source_header_row_html or compactified_contracts"
uv run pytest test_unstructured/chunking/test_base.py
uv run pytest -q test_unstructured/common/test_html_table.py

# E2E carried-header smoke (inline, no local-only script dependency)
uv run - <<'PY'
from unstructured.chunking.dispatch import chunk_elements, reconstruct_table_from_chunks
from unstructured.documents.elements import ElementMetadata, Table

html = (
    "<table>"
    "<tr><th>Header A</th><th>Header B</th></tr>"
    "<tr><th>Subhead A</th><th>Subhead B</th></tr>"
    + "".join(f"<tr><td>Row {i}A</td><td>Row {i}B</td></tr>" for i in range(1, 8))
    + "</table>"
)
text = "\n".join(
    ["Header A Header B", "Subhead A Subhead B"]
    + [f"Row {i}A Row {i}B" for i in range(1, 8)]
)
chunks = chunk_elements(
    [Table(text=text, metadata=ElementMetadata(text_as_html=html))],
    max_characters=75,
    new_after_n_chars=75,
    overlap=0,
    overlap_all=False,
    repeat_table_headers=True,
)
assert len(chunks) == 4
assert [c.metadata.num_carried_over_header_rows for c in chunks] == [0, 2, 2, 2]
for chunk in chunks[1:]:
    assert "<th>Header A</th>" in chunk.metadata.text_as_html
    assert "<th>Subhead A</th>" in chunk.metadata.text_as_html
    assert chunk.text.startswith("Header A Header B Subhead A Subhead B ")

reconstructed = reconstruct_table_from_chunks(chunks)
assert len(reconstructed) == 1
assert reconstructed[0].metadata.text_as_html.count("<tr>") == 9
print("e2e-table-header-smoke: ok")
PY

Observed results:

  • ./init.sh: pass (7 passed, 6 passed)
  • make check: pass
  • carried-header focused pytest selection: pass (3 passed, 203 deselected)
  • E2E smoke: pass (e2e-table-header-smoke: ok)
  • additional targeted/full chunking and html-table tests listed above: pass

authored by codex

@cragwolfe cragwolfe marked this pull request as ready for review April 2, 2026 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant