Skip to content

openspec: pdf-anonymisation-discovery research#1591

Open
rjzondervan wants to merge 1 commit into
developmentfrom
feat/pdf-anonymisation-discovery
Open

openspec: pdf-anonymisation-discovery research#1591
rjzondervan wants to merge 1 commit into
developmentfrom
feat/pdf-anonymisation-discovery

Conversation

@rjzondervan
Copy link
Copy Markdown
Member

Summary

  • Discovery doc — explicitly NOT a change yet. Captures the design analysis for anonymising PDF inputs in OpenRegister, for team review before scaffolding a real change.
  • Documents the hard constraints, the eight approaches considered, and a recommended two-tier architecture.
  • Includes six open questions (Q1-Q6) the team should weigh in on before commitment.

What's in this PR

One file: openspec/changes/pdf-anonymisation-discovery/discovery.md (~600 lines).

Structure:

  • Purpose — why this is a discovery, not yet a change
  • Context — the current corrupted PDF anonymisation flow + sister changes in flight
  • Hard constraints — five non-negotiables (no PII in output, identifiable placeholders, layout intact, FOSS only, no sidecars)
  • Approaches — table of 8 paths (A-H) with kept/rejected reasoning
  • Recommended architecture — Path A (SAPP byte-replace with Helvetica fallback, variants matching, TJ flattening, validation gate) + Path B (NC Office API ODT round-trip) fallback
  • Detailed work breakdown — per-piece for both paths, including the multi-encoding stream filter support (FlateDecode, LZWDecode, ASCII85Decode, ASCIIHexDecode, RunLengthDecode + chaining)
  • Open questions Q1-Q6 — for team decision
  • Risks and mitigations
  • Recommended next step — 2-3 day empirical spike on ~10 real Woo PDFs before scaffolding

Why this isn't a change yet

The FOSS-PHP-native space for true PDF text replacement is narrow. Setasign's SetaPDF-Redactor exists commercially because nobody filled it in OSS. The recommended path uses ddn/sapp (LGPL-3.0-or-later) as a structural substrate + custom encoding/CMap/replacement-injection layers we'd build on top.

That's ~4-5 weeks of engineering for Path A + ~1-2 weeks for Path B. Worth doing IF the team agrees on the trade-offs before commitment. Discovery doc exists so we can have that conversation against a written artifact rather than ad-hoc.

Composition

References:

Status

For team review. After review and decisions on Q1-Q6:

  1. Run the empirical spike (Q6) on real Woo PDFs.
  2. Scaffold the actual change(s) — likely one for Path A and one for Path B, depending on Q1.

Test plan

  • Read end-to-end (~15-20 min)
  • Decisions on Q1 (two-tier vs single-tier)
  • Decisions on Q2 (placeholder format — keep [<TYPE>: <id>] or switch to subset-safe)
  • Decisions on Q3 (/XMP — sanitise in place or remove)
  • Decisions on Q4 (validation gate failure handling)
  • Test fixture sourcing plan (Q5)
  • Agreement on spike protocol (Q6)

Discovery doc capturing the design analysis for anonymising PDF inputs in
OpenRegister's DocumentProcessingHandler. Status: research, not yet a
change. Exists to be reviewed by the team before scaffolding a real
change, because the FOSS-PHP-native space for PDF text replacement is
narrow enough to warrant collective agreement on trade-offs first.

Documents the hard constraints (no PII leakage, identifiable
placeholders, layout preservation, FOSS-only, no sidecars), the
considered approaches (8 paths A-H), and a recommended two-tier
architecture: Path A primary via ddn/sapp byte-replace with Helvetica
fallback and multi-encoding stream filter support; Path B fallback via
NC Office API ODT round-trip. Includes detailed work breakdown, open
questions for team review (Q1-Q6), and a proposed empirical spike
before commitment.
@github-actions
Copy link
Copy Markdown
Contributor

Quality Report — ConductionNL/openregister @ 0492651

Check PHP Vue Security License Tests
lint
phpcs
phpmd
psalm
phpstan
phpmetrics
eslint
stylelint
composer ✅ 162/162
npm ✅ 602/602
PHPUnit
Newman ⏭️
Playwright ⏭️

Quality workflow — 2026-05-19 09:31 UTC

Download the full PDF report from the workflow artifacts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant