openspec: pdf-anonymisation-discovery research by rjzondervan · Pull Request #1591 · ConductionNL/openregister

rjzondervan · 2026-05-19T09:25:00Z

Summary

Discovery doc — explicitly NOT a change yet. Captures the design analysis for anonymising PDF inputs in OpenRegister, for team review before scaffolding a real change.
Documents the hard constraints, the eight approaches considered, and a recommended two-tier architecture.
Includes six open questions (Q1-Q6) the team should weigh in on before commitment.

What's in this PR

One file: openspec/changes/pdf-anonymisation-discovery/discovery.md (~600 lines).

Structure:

Purpose — why this is a discovery, not yet a change
Context — the current corrupted PDF anonymisation flow + sister changes in flight
Hard constraints — five non-negotiables (no PII in output, identifiable placeholders, layout intact, FOSS only, no sidecars)
Approaches — table of 8 paths (A-H) with kept/rejected reasoning
Recommended architecture — Path A (SAPP byte-replace with Helvetica fallback, variants matching, TJ flattening, validation gate) + Path B (NC Office API ODT round-trip) fallback
Detailed work breakdown — per-piece for both paths, including the multi-encoding stream filter support (FlateDecode, LZWDecode, ASCII85Decode, ASCIIHexDecode, RunLengthDecode + chaining)
Open questions Q1-Q6 — for team decision
Risks and mitigations
Recommended next step — 2-3 day empirical spike on ~10 real Woo PDFs before scaffolding

Why this isn't a change yet

The FOSS-PHP-native space for true PDF text replacement is narrow. Setasign's SetaPDF-Redactor exists commercially because nobody filled it in OSS. The recommended path uses ddn/sapp (LGPL-3.0-or-later) as a structural substrate + custom encoding/CMap/replacement-injection layers we'd build on top.

That's ~4-5 weeks of engineering for Path A + ~1-2 weeks for Path B. Worth doing IF the team agrees on the trade-offs before commitment. Discovery doc exists so we can have that conversation against a written artifact rather than ad-hoc.

Composition

References:

Sister changes office-document-sanitization (openspec: office-document-sanitization scaffolding #1589) and text-extraction-office-completeness (openspec: text-extraction-office-completeness scaffolding #1590) — Path B fallback reuses both.
Existing changes anonymise-output-as-pdf-by-default (DocuDesk) and ocr-document-scanning (DocuDesk) — Path B output and scan-PDF preprocessing.

Status

For team review. After review and decisions on Q1-Q6:

Run the empirical spike (Q6) on real Woo PDFs.
Scaffold the actual change(s) — likely one for Path A and one for Path B, depending on Q1.

Test plan

Read end-to-end (~15-20 min)
Decisions on Q1 (two-tier vs single-tier)
Decisions on Q2 (placeholder format — keep [<TYPE>: <id>] or switch to subset-safe)
Decisions on Q3 (/XMP — sanitise in place or remove)
Decisions on Q4 (validation gate failure handling)
Test fixture sourcing plan (Q5)
Agreement on spike protocol (Q6)

Discovery doc capturing the design analysis for anonymising PDF inputs in OpenRegister's DocumentProcessingHandler. Status: research, not yet a change. Exists to be reviewed by the team before scaffolding a real change, because the FOSS-PHP-native space for PDF text replacement is narrow enough to warrant collective agreement on trade-offs first. Documents the hard constraints (no PII leakage, identifiable placeholders, layout preservation, FOSS-only, no sidecars), the considered approaches (8 paths A-H), and a recommended two-tier architecture: Path A primary via ddn/sapp byte-replace with Helvetica fallback and multi-encoding stream filter support; Path B fallback via NC Office API ODT round-trip. Includes detailed work breakdown, open questions for team review (Q1-Q6), and a proposed empirical spike before commitment.

github-actions · 2026-05-19T09:31:38Z

Quality Report — ConductionNL/openregister @ `0492651`

Check	PHP	Vue	Security	License	Tests
lint	✅
phpcs	✅
phpmd	✅
psalm	✅
phpstan	✅
phpmetrics	✅
eslint		✅
stylelint		✅
composer			✅	✅ 162/162
npm			✅	✅ 602/602
PHPUnit					✅
Newman					⏭️
Playwright					⏭️

Quality workflow — 2026-05-19 09:31 UTC

Download the full PDF report from the workflow artifacts.

rjzondervan requested review from Rem-Dam and rubenvdlinde as code owners May 19, 2026 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openspec: pdf-anonymisation-discovery research#1591

openspec: pdf-anonymisation-discovery research#1591
rjzondervan wants to merge 1 commit into
developmentfrom
feat/pdf-anonymisation-discovery

rjzondervan commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjzondervan commented May 19, 2026

Summary

What's in this PR

Why this isn't a change yet

Composition

Status

Test plan

Uh oh!

github-actions Bot commented May 19, 2026

Quality Report — ConductionNL/openregister @ 0492651

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Quality Report — ConductionNL/openregister @ `0492651`