Skip to content

Implement manual-entity-anonymisation: POST /api/files/{fileId}/manual-entities (#1593)#1592

Open
rjzondervan wants to merge 8 commits into
developmentfrom
feat/manual-entity-anonymisation
Open

Implement manual-entity-anonymisation: POST /api/files/{fileId}/manual-entities (#1593)#1592
rjzondervan wants to merge 8 commits into
developmentfrom
feat/manual-entity-anonymisation

Conversation

@rjzondervan
Copy link
Copy Markdown
Member

@rjzondervan rjzondervan commented May 19, 2026

Summary

Full implementation of the manual-entity-anonymisation change. Extends the entity-relation-grondslagen capability with a write path for operator-supplied entities:

  • New endpoint POST /api/files/{fileId}/manual-entities: operator types text + entity type, server matches against the file's TextExtractionService chunks, creates a GdprEntity catalogue entry (or reuses an existing one for the same (value, type) pair), and inserts one EntityRelation per occurrence — all atomically.
  • Idempotent: re-calling with the same value on the same file does not create duplicates (each match position is probed via EntityRelationMapper::existsForFileAtPosition before insert; already-present rows bump matchesSkipped).
  • Defaults: wholeWord = true, caseSensitive = true (operator means exactly what they typed).
  • Zero matches returns HTTP 200 with the catalogue entry still created + a message; operator intent is recorded in the audit trail regardless.
  • RBAC: caller must have file-write access (isUpdateable()) — same model as PATCH /api/entity-relations/{id}.
  • PII redaction (ADR-005): request log captures valueLength only, never value. Error responses never echo the operator-supplied text.
  • Audit trail (ADR-022): entity_create (only on new catalogue rows) + entity_relations_batch_create (every call, including zero-match).

What's in this PR

  • Scaffolding (already in this branch): openspec/changes/manual-entity-anonymisation/{proposal,design,tasks}.md + specs/entity-relation-grondslagen/spec.md.
  • Implementation:
    • lib/Service/File/ChunkTextMatcher.php — chunk-aware preg_match_all + absolute-position dedup (overlap-region: lowest chunkIndex wins). Rejects needles > chunk overlap; PII-safe error messages.
    • lib/Service/File/ManualEntityService.php — orchestrates the lookup-or-create + batch insert + audit-trail write inside one IDBConnection::beginTransaction() window. Catches and logs the root cause on rollback before re-wrapping as ManualEntityException.
    • lib/Service/File/ManualEntityResult.php — DTO {entity, entityWasNew, relations, matchCount, matchesSkipped}.
    • lib/Exception/{ChunkMatcherException,ManualEntityException}.php — typed reason codes (value_too_long, regex_compile_failure, file_not_extracted, unsupported_entity_type, internal_error).
    • lib/Db/DetectionMethod.php — constants (PRESIDIO / OPENANONYMISER / PATTERN / MANUAL).
    • lib/Db/GdprEntityMapper::findOneByValueAndType(value, type) lookup with dedup-invariant warning + first-row fallback.
    • lib/Db/EntityRelationMapper::existsForFileAtPosition + ::insertBatch — idempotency probe + caller-managed-transaction batch insert.
    • lib/Controller/FileTextController::addManualEntity — content-type guard, body validation, exception → status-code mapping, PII-redacted log.
    • appinfo/routes.phpPOST /api/files/{fileId}/manual-entities.
  • Server-side category derivation: the endpoint does NOT accept a category field; the column is NOT NULL and is populated from type via EntityRecognitionHandler::getCategoryForType() (lifted to public static for shared use), matching the detector flow. This was a deliberate UX simplification + bug-class elimination (the operator UI rarely has the context to set category, and forgetting to set it was the original source of HTTP 500s during testing).
  • Tests: 36 new unit tests (matcher / service / mapper / controller) — covering happy/reuse/idempotent/zero-match/file-not-extracted/RBAC/audit-failure-rollback paths + the ADR-005 PII redaction assertion on the request log.
  • Docs: docs/Features/manual-entity-anonymisation.md + CHANGELOG entry under ### Added.

Cross-change dependency

This branch is based on development directly. It does NOT include the entity-relation-grondslagen work (PR #1503) — that's tracked separately. The two changes share the EntityRelation table:

Verification

  • openspec validate manual-entity-anonymisation — clean.
  • PHPCS clean on all 10 new + touched files (legacy FileTextControllerTest.php + appinfo/routes.php PHPCS noise is pre-existing on development and untouched here).
  • PHPStan clean on the full diff.
  • 102 unit tests / 309 assertions all green in the live NC container (vendor/bin/phpunit --filter 'ChunkTextMatcherTest|ManualEntityServiceTest|FileTextController|GdprEntityMapperTest|EntityRelationMapperTest').
  • End-to-end smoke test against the dev stack:
    • Empty match (value not in file) → HTTP 200 with message; catalogue row created.
    • Match (value: "Robert" on a meeting-notes file) → HTTP 201, matchCount: 4, 4 relations created.
    • Idempotent retry → HTTP 201, matchCount: 4, matchesSkipped: 4, relations: [] (no duplicates).

Test plan

Scaffolds an openspec change that extends the entity-relation-grondslagen
capability with a write path for operator-supplied entities. New endpoint
POST /api/files/{fileId}/manual-entities accepts operator-typed text +
type, performs server-side chunk-aware exact-string matching against the
file's TextExtractionService chunks, and atomically creates a catalogue
entry (or reuses an existing one for the same value+type pair) plus one
EntityRelation row per occurrence found. Idempotent on re-call. Defaults:
whole-word + case-sensitive matching. Zero-match returns 200 with the
catalogue entry created and a message. RBAC: file-write access required.

Matcher is overlap-aware: per-chunk preg_match_all with absolute-position
deduplication (chunk_overlap default 200 chars; needles > overlap are
rejected). Manual entities end up smarter than current Presidio detection
(which has the same overlap-double-count behaviour); flagged as future
work in proposal Out of scope ("presidio-overlap-dedup").

Pure openspec - proposal, design (10 decisions), capability spec delta
(6 ADDED Requirements), tasks (10 sections). No implementation in this
PR; awaiting team review before /opsx:apply.
@github-actions
Copy link
Copy Markdown
Contributor

Quality Report — ConductionNL/openregister @ 7194ad8

Check PHP Vue Security License Tests
lint
phpcs
phpmd
psalm
phpstan
phpmetrics
eslint
stylelint
composer ✅ 162/162
npm ✅ 602/602
PHPUnit
Newman ⏭️
Playwright ⏭️

Quality workflow — 2026-05-19 13:10 UTC

Download the full PDF report from the workflow artifacts.

…+ service + endpoint (#1593)

Closes sections 1-4 of the manual-entity-anonymisation tasks.md.
Adds the operator-facing `POST /api/files/{fileId}/manual-entities`
endpoint and the persistence layer it sits on. The existing anonymise
flow picks up the new relation rows unchanged.

**Value objects + constants (section 1):**
- `lib/Service/File/ChunkTextMatcher.php` — pure utility, per-chunk
  preg_match_all + absolute-position dedup per design §D2. Lowest
  chunkIndex wins on ties so re-runs select the same canonical chunk
  (idempotency precondition). Rejects needles longer than the chunk
  overlap with a typed exception; PII-redacted error messages.
- `lib/Exception/ChunkMatcherException.php` — typed reasons
  (`value_too_long`, `regex_compile_failure`). Extends `\Exception`
  directly so generic catches in the controller don't absorb it.
- `lib/Exception/ManualEntityException.php` — orchestration-layer
  exception with typed reasons (`file_not_extracted`,
  `unsupported_entity_type`, `regex_compile_failure`, `internal_error`).
  Same direct-`\Exception` rationale.
- `lib/Db/DetectionMethod.php` — `PRESIDIO` / `OPENANONYMISER` /
  `PATTERN` / `MANUAL` constants. Final class, no instantiation.

**Mapper extensions (section 2):**
- `GdprEntityMapper::findOneByValueAndType` — `LIMIT 2` probe; logs
  a warning on dedup-invariant violation (more than one row for the
  same `(value, type)` pair) and returns the first; never throws.
  Constructor gains a nullable `LoggerInterface` so legacy direct
  callers still work.
- `EntityRelationMapper::existsForFileAtPosition` — single-row probe
  on the full `(fileId, entityId, chunkId, positionStart, positionEnd)`
  tuple. Backs the idempotency contract on retries.
- `EntityRelationMapper::insertBatch` — caller-managed-transaction
  batch insert. Validates required fields via setter type-hints, lets
  DB errors propagate so the caller can roll back. Verified: the
  `openregister_entity_relations` table has no `uuid` column (per the
  create-migration); row payload + response omit `uuid`.

**Orchestrator (section 3):**
- `lib/Service/File/ManualEntityService.php` — single point of truth
  for the file-scoped operator-add operation. Lookup-or-create
  catalogue entry, run the matcher, probe for existing relations,
  batch-insert new ones, write audit trails — all inside one
  `IDBConnection::beginTransaction()` / `commit()` / `rollBack()`
  window. Translates `ChunkMatcherException` to
  `ManualEntityException(REASON_REGEX_COMPILE_FAILURE)`.
- `lib/Service/File/ManualEntityResult.php` — immutable DTO carrying
  `{entity, entityWasNew, relations[], matchCount, matchesSkipped}`.
- `assertFileWriteAccess` — inlined user-folder lookup + `isUpdateable()`
  check (mirrors `EntityRelationsController::canWriteFile`). Throws
  `ManualEntityException` with a `forbidden:` message prefix so the
  controller maps it to 403 instead of the default 500 for
  `REASON_INTERNAL_ERROR`.
- Two audit-trail actions per the spec: `entity_create` (only when a
  new catalogue row is inserted) and `entity_relations_batch_create`
  (every call). ADR-022 forensic-exception: the operator-supplied
  `value` is allowed in the audit changed payload, never in HTTP
  logs / error responses.

**Controller + route (section 4):**
- `FileTextController::addManualEntity(int $fileId): JSONResponse` —
  content-type guard (415 on non-JSON), input validation (400 on
  missing `value` / `type`), session check (401), service dispatch,
  exception translation, response formatting.
- Response shape per the proposal: 201 with `entity` + `relations` +
  `matchCount` + `matchesSkipped` on ≥1 match, 200 with the same
  shape plus `message` on zero matches.
- `mapManualEntityException` translates `forbidden:`-prefixed
  REASON_INTERNAL_ERROR → 403, REASON_FILE_NOT_EXTRACTED → 422,
  REASON_REGEX_COMPILE_FAILURE / REASON_UNSUPPORTED_ENTITY_TYPE → 400,
  otherwise 500.
- PII-redacted request log: `valueLength` only, never `value`.
  Actor UID is logged (ADR-005 allows UID).
- Route registered at `POST /api/files/{fileId}/manual-entities`.

**Quality:**
- phpcs clean (4 line-length warnings, no errors) on all 10 touched
  files.
- phpstan clean.
- psalm clean.
- Tests / docs / CHANGELOG follow in next commits (tasks 5-9).

Status tracked in
`openspec/changes/manual-entity-anonymisation/tasks.md`
(boxes 1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3, 3.4, 3.5, 4.1,
4.2, 4.3 checked) and `plan.json` (32 sub-tasks, this commit covers
the 14 implementation items).
… controller (#1593)

Covers tasks 7.1–7.3 from the manual-entity-anonymisation tasks.md.

**New test files:**
- `tests/Unit/Service/File/ChunkTextMatcherTest.php` (13 tests)
  Single-chunk match, multi-match, overlap-region dedup (lowest
  chunkIndex wins), distinct cross-chunk matches, deterministic
  re-runs, whole-word / case-sensitive defaults + flips, zero-match
  empty array, empty-chunk-list empty array, value-too-long throws
  with no PII in message, empty needle returns empty.
- `tests/Unit/Service/File/ManualEntityServiceTest.php` (8 tests)
  Happy path (new entity), reuse path, idempotent retry (all skip),
  zero-match still records, file-not-extracted aborts before any
  transaction, read-only file → forbidden, missing-file-for-actor,
  audit-failure → rollBack + wrap as REASON_INTERNAL_ERROR.
- `tests/Unit/Controller/FileTextControllerManualEntityTest.php`
  (11 tests) — own sibling file matching the `Coverage` / `Deep` /
  `Gap` naming pattern. 415, 401, 400 (missing value / type), 403
  (forbidden:-prefixed), 422 (file_not_extracted), 400
  (regex_compile_failure), 500 (unknown throwable), 201 (success),
  200 (zero match), and a PII-redaction spy on the request log
  asserting the operator-supplied `value` never reaches the log
  record.

**Pre-existing fixes folded in (per the project rule on encountered
quality issues):**
- `FileTextController` constructor gained two new required params
  (`ManualEntityService`, `IUserSession`); the three existing test
  classes — `FileTextControllerTest`, `FileTextControllerCoverageTest`,
  `FileTextControllerDeepTest` — were updated to construct the SUT
  with all 10 args.
- Three tests in `FileTextControllerTest` (`testAnonymizeFileSuccess`,
  `testAnonymizeFileDeduplicatesEntities`,
  `testAnonymizeFileExceptionDuringAnonymization`) plus one in
  `FileTextControllerDeepTest` mocked the long-removed
  `findEntitiesForFile` method; renamed to the current
  `findEntitiesForAnonymization`. These tests were silently producing
  HTTP-400 "no entities" responses instead of exercising the success
  / exception branches they claim to cover.

**Verification:** 90 tests / 276 assertions all green in the live
`master-nextcloud-1` container (`docker exec ... vendor/bin/phpunit
--filter 'ChunkTextMatcherTest|ManualEntityServiceTest|FileTextController'`).
The local fallback bootstrap (no NC) runs the pure unit subset and
shows the same set of passes; the container run is the authoritative
gate.

**Note on PHPUnit named-args:** the Nextcloud `Entity::__call` magic
does NOT translate named arguments to its `$args` positional array,
so entity setter calls in tests MUST use positional arguments
(`$relation->setPositionStart(13)`, not `setPositionStart(value: 13)`).
PHPCS Conduction-rule's named-parameter sniff doesn't fire on these
because `setPositionStart` isn't a real declared method.

tasks.md updated to mark 7.1–7.3 done with the relevant implementation
notes.
…ueAndType + insertBatch + existsForFileAtPosition (#1593)

Covers tasks 5.1, 5.2, 7.4, 7.5 from the manual-entity-anonymisation
tasks.md.

**New tests (12 cases, 34 assertions):**
- `tests/Unit/Db/GdprEntityMapperTest.php` (4 cases) — existing
  matching row returned, null when no row, dedup-invariant
  violation logs a warning + returns the first row (with ADR-005
  PII redaction asserted on the log payload), no-logger path
  silently safe.
- `tests/Unit/Db/EntityRelationMapperTest.php` (8 cases) — adds
  to the pre-existing single construction-wiring test:
  insertBatch (iterates rows in order, empty rows are a no-op,
  propagates exceptions to the caller, defaults `createdAt`,
  keeps explicit `createdAt`), existsForFileAtPosition (true on
  matching row, false on empty fetch).

**Production change (minimal, behaviour-preserving):**
- `lib/Db/GdprEntityMapper::findOneByValueAndType` now calls
  `$this->findEntities(...)` instead of `parent::findEntities(...)`.
  Identical at runtime (the mapper does not override findEntities)
  but allows a test subclass to inject row sets without booting
  the QBMapper SQL chain. Same testability adjustment is
  unnecessary for `EntityRelationMapper` because the new methods
  use `$this->insert(...)` (already overridable) and a directly-
  mockable IQueryBuilder chain.

**Approach notes (kept in tasks.md):**
- Tasks 5.1 and 5.2 are marked done via code inspection — the
  relevant SELECTs (lib/Db/EntityRelationMapper.php:327, and the
  `findEntityIdsByValueForFile` companion) do not reference
  `detection_method` at all, so manual-method rows are picked up
  automatically. A unit assertion of "WHERE clause omits a column"
  would be brittle; the contract is better tested via the
  integration test (7.6, deferred).
- Per the pre-existing EntityRelationMapperTest convention,
  "DB-heavy query behaviour is covered by integration tests" —
  these new tests target the row → entity field mapping +
  return-value contract, not SQL semantics.

**Verification:** 102 tests / 310 assertions in the live
`master-nextcloud-1` container across the change's full surface
(`docker exec ... vendor/bin/phpunit --filter
'ChunkTextMatcherTest|ManualEntityServiceTest|FileTextController|GdprEntityMapperTest|EntityRelationMapperTest'`).
…, PHPCS cleanup on new tests (#1593)

Covers tasks 5.1, 5.2, 8.1, 8.2, and the 9.4 named-parameter sweep
from the manual-entity-anonymisation tasks.md.

**Docs (8.1, 8.2):**
- New `docs/Features/manual-entity-anonymisation.md` — sibling to
  `entity-relation-decision-metadata.md`. Covers endpoint contract,
  semantics (atomicity, idempotency, lookup-or-create, match flag
  defaults), audit-trail behaviour (`entity_create` +
  `entity_relations_batch_create`), anonymise-flow interaction
  (no `detection_method` filter), the value-keyed substitution
  caveat, PII redaction (ADR-005 + ADR-022 forensic exception),
  and the RBAC contract.
- CHANGELOG entry under `## Unreleased → ### Added` with the
  `#1593` issue link.

**PHPCS sweep on new tests (9.4):**
- All 5 new test files (`ChunkTextMatcherTest`,
  `ManualEntityServiceTest`, `FileTextControllerManualEntityTest`,
  `GdprEntityMapperTest`, `EntityRelationMapperTest`) are now
  PHPCS-clean: phpcbf-driven blank-line normalisation, member-var
  docblocks added, `$this->exactly(N)` and `$this->createMock(C::class)`
  converted to named-args, two inline-comment capitalisations
  fixed, the `EntityRelation` instanceof guard added to the
  test-mapper's `insert` override to satisfy PHPStan's array-type
  refinement.
- PHPStan clean on the full diff (new + modified production
  files + new + modified tests).
- All 9 new production files (Service/File/*, Exception/*,
  Db/DetectionMethod, FileTextController, GdprEntityMapper,
  EntityRelationMapper) remain PHPCS-clean (no new violations
  introduced by this commit).
- Out of scope: pre-existing PHPCS noise in legacy
  `FileTextControllerTest` / `FileTextControllerCoverageTest` /
  `FileTextControllerDeepTest` (~270 errors apiece, predates
  this change). Touched only the constructor wiring and the one
  stale `findEntitiesForFile` → `findEntitiesForAnonymization`
  rename per the previous commit; phpcbf-fixed what it could
  but did not chase the legacy backlog.
- Out of scope: `appinfo/routes.php` (583 pre-existing errors,
  one 3-line addition by this change followed the existing file
  pattern).

**Spec status (5.1, 5.2, 6.1):**
- 5.1 + 5.2 marked done — verified by code inspection that
  `findEntitiesForAnonymization` and `findEntityIdsByValueForFile`
  do not filter on `detection_method`, so manual-method rows
  flow through unchanged. Integration-level confirmation is
  deferred to task 7.6.
- 6.1 (update canonical
  `openspec/specs/entity-relation-grondslagen/spec.md`) is a
  no-op until the `entity-relation-grondslagen` change archives
  (no canonical capability spec exists yet — the spec lives in
  the in-flight delta at
  `openspec/changes/entity-relation-grondslagen/specs/entity-relation-grondslagen/spec.md`).
  Marked deferred in tasks.md.

**Verification:** `openspec validate manual-entity-anonymisation`
clean. PHPStan clean on all touched files. PHPUnit: 102 tests /
310 assertions all green in the `master-nextcloud-1` container.
)

Marks 6.1, 9.1, 9.2, 9.4, 10.1, 10.3 with their final status +
implementation notes. Remaining open items:

- 7.6 (integration test against stacked OR) — deferred, needs
  running stack + chunk-fixture wiring.
- 9.3 (manual smoke test) — operator-driven, out of scope.
- 10.2 (DocuDesk tracking issue) — pending user confirmation
  before filing.

29 of 32 sub-tasks are now done; the remaining three are
explicitly scoped out of this branch (integration test + operator
smoke + cross-app issue filing).
#1593)

Closes task 10.2. Filed
[ConductionNL/docudesk#225](ConductionNL/docudesk#225)
— `[OpenSpec] [docudesk] manual-entity-anonymisation-ui (PoC)`.

PoC framing is explicit in the title + summary + task list ("stock
NC components, no custom styling, no new deps") so the frontend
team uses the work as a reference, not as a production
implementation.

30 of 32 sub-tasks done. Remaining 2 are out of scope for this
branch: 7.6 (integration test, needs a stacked OR run) and 9.3
(operator-driven manual smoke).
…-side; surface root cause on rollback (#1593)

**Root cause of the operator-reported 500:**
The `oc_openregister_entities` table has TWO `NOT NULL`-with-no-default
columns (`category` and `updated_at`) that the existing detector flow
(`EntityRecognitionHandler::findOrCreateEntity`) always populates. The
new manual-entity flow only populated `detected_at`, so the insert
failed at the DB layer with two consecutive `SQLSTATE[HY000] 1364
Field 'X' doesn't have a default value` errors — one per column.

**Fix #1 — `category` is now server-derived, removed from the API:**
- `EntityRecognitionHandler::getCategoryForType()` is now
  `public static` (was `private`) so producers outside the detector
  pipeline can derive the same mapping without taking on the
  handler's DI graph.
- `ManualEntityService::lookupOrCreateEntity()` calls it on every new
  insert. Rationale: the operator UI rarely has the context to make a
  category decision, the detector flow already derives consistently,
  and exposing the field as operator-input was the original source of
  the bug (omitted = null = NOT NULL violation). Manual-entity rows
  and detector rows now share categories for identical types.
- `category` removed from the request body, the service signature, the
  `writeAuditTrails` signature, the controller body-parsing, and the
  unit tests. The audit trail still records `category` in
  `changed.fields` — read from the entity after insert, so the
  persisted value is captured verbatim.
- Operator-override on category is intentionally NOT exposed in v1; a
  follow-up if a concrete use case emerges.

**Fix #2 — `updated_at` populated alongside `detected_at`:**
- Same root cause class. Mirroring the detector flow's column
  population fixes it.

**Fix #3 — operator-debuggable rollback path:**
The service's `catch (Throwable $error)` block wrapped the original
exception as `ManualEntityException(REASON_INTERNAL_ERROR)` without
logging. Combined with the controller's `mapManualEntityException`
only logging the reason code (not the previous-exception message), a
500 left no debuggable trail in `nextcloud.log` — exactly what the
operator hit. Added a `LoggerInterface::error()` call inside the
service's rollback handler that captures `errorType`, `error`, and
`trace` of the underlying Throwable before re-wrapping. Future
operator-reported 500s will have a stack-trace-level log entry
keyed by reqId.

**Spec + docs updates:**
- `docs/Features/manual-entity-anonymisation.md` — request body shape
  loses the `category` line; added a note explaining the server-side
  derivation.
- `openspec/changes/manual-entity-anonymisation/proposal.md` — example
  body updated; the 400 error hint updated.
- `openspec/changes/manual-entity-anonymisation/design.md` — derivation
  rationale folded into the existing "lookup-or-create" section; the
  open-question on category resolved with a "not in the API at all"
  decision.
- `openspec/changes/manual-entity-anonymisation/specs/entity-relation-grondslagen/spec.md`
  — Requirement updated, "Category mismatch does NOT trigger an
  update" scenario replaced with "New catalogue row carries the
  server-derived category".
- `openspec validate manual-entity-anonymisation` clean.

**Verification:**
- PHPCS clean (4 touched files including the new logger call).
- PHPStan clean on the full diff.
- 44 unit tests / 145 assertions across
  `ManualEntityServiceTest|FileTextControllerManualEntityTest|ChunkTextMatcherTest|GdprEntityMapperTest|EntityRelationMapperTest`
  all green in the live `master-nextcloud-1` container.
- End-to-end smoke test in the dev environment:
  - Empty-match POST → HTTP 200 with `message`, catalogue row created.
  - Match POST → HTTP 201 with `matchCount=4, matchesSkipped=2` (the
    2 skipped were pre-existing detector hits at the same positions).
  - Idempotent re-POST → HTTP 201 with `matchCount=4, matchesSkipped=4`,
    `relations: []` (no duplicates).

Refs: #1593
@rjzondervan rjzondervan changed the title openspec: manual-entity-anonymisation scaffolding Implement manual-entity-anonymisation: POST /api/files/{fileId}/manual-entities (#1593) May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant