Implement manual-entity-anonymisation: POST /api/files/{fileId}/manual-entities (#1593)#1592
Open
rjzondervan wants to merge 8 commits into
Open
Implement manual-entity-anonymisation: POST /api/files/{fileId}/manual-entities (#1593)#1592rjzondervan wants to merge 8 commits into
rjzondervan wants to merge 8 commits into
Conversation
Scaffolds an openspec change that extends the entity-relation-grondslagen
capability with a write path for operator-supplied entities. New endpoint
POST /api/files/{fileId}/manual-entities accepts operator-typed text +
type, performs server-side chunk-aware exact-string matching against the
file's TextExtractionService chunks, and atomically creates a catalogue
entry (or reuses an existing one for the same value+type pair) plus one
EntityRelation row per occurrence found. Idempotent on re-call. Defaults:
whole-word + case-sensitive matching. Zero-match returns 200 with the
catalogue entry created and a message. RBAC: file-write access required.
Matcher is overlap-aware: per-chunk preg_match_all with absolute-position
deduplication (chunk_overlap default 200 chars; needles > overlap are
rejected). Manual entities end up smarter than current Presidio detection
(which has the same overlap-double-count behaviour); flagged as future
work in proposal Out of scope ("presidio-overlap-dedup").
Pure openspec - proposal, design (10 decisions), capability spec delta
(6 ADDED Requirements), tasks (10 sections). No implementation in this
PR; awaiting team review before /opsx:apply.
Contributor
Quality Report — ConductionNL/openregister @
|
| Check | PHP | Vue | Security | License | Tests |
|---|---|---|---|---|---|
| lint | ✅ | ||||
| phpcs | ✅ | ||||
| phpmd | ✅ | ||||
| psalm | ✅ | ||||
| phpstan | ✅ | ||||
| phpmetrics | ✅ | ||||
| eslint | ✅ | ||||
| stylelint | ✅ | ||||
| composer | ✅ | ✅ 162/162 | |||
| npm | ✅ | ✅ 602/602 | |||
| PHPUnit | ✅ | ||||
| Newman | ⏭️ | ||||
| Playwright | ⏭️ |
Quality workflow — 2026-05-19 13:10 UTC
Download the full PDF report from the workflow artifacts.
32 tasks
…+ service + endpoint (#1593) Closes sections 1-4 of the manual-entity-anonymisation tasks.md. Adds the operator-facing `POST /api/files/{fileId}/manual-entities` endpoint and the persistence layer it sits on. The existing anonymise flow picks up the new relation rows unchanged. **Value objects + constants (section 1):** - `lib/Service/File/ChunkTextMatcher.php` — pure utility, per-chunk preg_match_all + absolute-position dedup per design §D2. Lowest chunkIndex wins on ties so re-runs select the same canonical chunk (idempotency precondition). Rejects needles longer than the chunk overlap with a typed exception; PII-redacted error messages. - `lib/Exception/ChunkMatcherException.php` — typed reasons (`value_too_long`, `regex_compile_failure`). Extends `\Exception` directly so generic catches in the controller don't absorb it. - `lib/Exception/ManualEntityException.php` — orchestration-layer exception with typed reasons (`file_not_extracted`, `unsupported_entity_type`, `regex_compile_failure`, `internal_error`). Same direct-`\Exception` rationale. - `lib/Db/DetectionMethod.php` — `PRESIDIO` / `OPENANONYMISER` / `PATTERN` / `MANUAL` constants. Final class, no instantiation. **Mapper extensions (section 2):** - `GdprEntityMapper::findOneByValueAndType` — `LIMIT 2` probe; logs a warning on dedup-invariant violation (more than one row for the same `(value, type)` pair) and returns the first; never throws. Constructor gains a nullable `LoggerInterface` so legacy direct callers still work. - `EntityRelationMapper::existsForFileAtPosition` — single-row probe on the full `(fileId, entityId, chunkId, positionStart, positionEnd)` tuple. Backs the idempotency contract on retries. - `EntityRelationMapper::insertBatch` — caller-managed-transaction batch insert. Validates required fields via setter type-hints, lets DB errors propagate so the caller can roll back. Verified: the `openregister_entity_relations` table has no `uuid` column (per the create-migration); row payload + response omit `uuid`. **Orchestrator (section 3):** - `lib/Service/File/ManualEntityService.php` — single point of truth for the file-scoped operator-add operation. Lookup-or-create catalogue entry, run the matcher, probe for existing relations, batch-insert new ones, write audit trails — all inside one `IDBConnection::beginTransaction()` / `commit()` / `rollBack()` window. Translates `ChunkMatcherException` to `ManualEntityException(REASON_REGEX_COMPILE_FAILURE)`. - `lib/Service/File/ManualEntityResult.php` — immutable DTO carrying `{entity, entityWasNew, relations[], matchCount, matchesSkipped}`. - `assertFileWriteAccess` — inlined user-folder lookup + `isUpdateable()` check (mirrors `EntityRelationsController::canWriteFile`). Throws `ManualEntityException` with a `forbidden:` message prefix so the controller maps it to 403 instead of the default 500 for `REASON_INTERNAL_ERROR`. - Two audit-trail actions per the spec: `entity_create` (only when a new catalogue row is inserted) and `entity_relations_batch_create` (every call). ADR-022 forensic-exception: the operator-supplied `value` is allowed in the audit changed payload, never in HTTP logs / error responses. **Controller + route (section 4):** - `FileTextController::addManualEntity(int $fileId): JSONResponse` — content-type guard (415 on non-JSON), input validation (400 on missing `value` / `type`), session check (401), service dispatch, exception translation, response formatting. - Response shape per the proposal: 201 with `entity` + `relations` + `matchCount` + `matchesSkipped` on ≥1 match, 200 with the same shape plus `message` on zero matches. - `mapManualEntityException` translates `forbidden:`-prefixed REASON_INTERNAL_ERROR → 403, REASON_FILE_NOT_EXTRACTED → 422, REASON_REGEX_COMPILE_FAILURE / REASON_UNSUPPORTED_ENTITY_TYPE → 400, otherwise 500. - PII-redacted request log: `valueLength` only, never `value`. Actor UID is logged (ADR-005 allows UID). - Route registered at `POST /api/files/{fileId}/manual-entities`. **Quality:** - phpcs clean (4 line-length warnings, no errors) on all 10 touched files. - phpstan clean. - psalm clean. - Tests / docs / CHANGELOG follow in next commits (tasks 5-9). Status tracked in `openspec/changes/manual-entity-anonymisation/tasks.md` (boxes 1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3, 3.4, 3.5, 4.1, 4.2, 4.3 checked) and `plan.json` (32 sub-tasks, this commit covers the 14 implementation items).
… controller (#1593) Covers tasks 7.1–7.3 from the manual-entity-anonymisation tasks.md. **New test files:** - `tests/Unit/Service/File/ChunkTextMatcherTest.php` (13 tests) Single-chunk match, multi-match, overlap-region dedup (lowest chunkIndex wins), distinct cross-chunk matches, deterministic re-runs, whole-word / case-sensitive defaults + flips, zero-match empty array, empty-chunk-list empty array, value-too-long throws with no PII in message, empty needle returns empty. - `tests/Unit/Service/File/ManualEntityServiceTest.php` (8 tests) Happy path (new entity), reuse path, idempotent retry (all skip), zero-match still records, file-not-extracted aborts before any transaction, read-only file → forbidden, missing-file-for-actor, audit-failure → rollBack + wrap as REASON_INTERNAL_ERROR. - `tests/Unit/Controller/FileTextControllerManualEntityTest.php` (11 tests) — own sibling file matching the `Coverage` / `Deep` / `Gap` naming pattern. 415, 401, 400 (missing value / type), 403 (forbidden:-prefixed), 422 (file_not_extracted), 400 (regex_compile_failure), 500 (unknown throwable), 201 (success), 200 (zero match), and a PII-redaction spy on the request log asserting the operator-supplied `value` never reaches the log record. **Pre-existing fixes folded in (per the project rule on encountered quality issues):** - `FileTextController` constructor gained two new required params (`ManualEntityService`, `IUserSession`); the three existing test classes — `FileTextControllerTest`, `FileTextControllerCoverageTest`, `FileTextControllerDeepTest` — were updated to construct the SUT with all 10 args. - Three tests in `FileTextControllerTest` (`testAnonymizeFileSuccess`, `testAnonymizeFileDeduplicatesEntities`, `testAnonymizeFileExceptionDuringAnonymization`) plus one in `FileTextControllerDeepTest` mocked the long-removed `findEntitiesForFile` method; renamed to the current `findEntitiesForAnonymization`. These tests were silently producing HTTP-400 "no entities" responses instead of exercising the success / exception branches they claim to cover. **Verification:** 90 tests / 276 assertions all green in the live `master-nextcloud-1` container (`docker exec ... vendor/bin/phpunit --filter 'ChunkTextMatcherTest|ManualEntityServiceTest|FileTextController'`). The local fallback bootstrap (no NC) runs the pure unit subset and shows the same set of passes; the container run is the authoritative gate. **Note on PHPUnit named-args:** the Nextcloud `Entity::__call` magic does NOT translate named arguments to its `$args` positional array, so entity setter calls in tests MUST use positional arguments (`$relation->setPositionStart(13)`, not `setPositionStart(value: 13)`). PHPCS Conduction-rule's named-parameter sniff doesn't fire on these because `setPositionStart` isn't a real declared method. tasks.md updated to mark 7.1–7.3 done with the relevant implementation notes.
…ueAndType + insertBatch + existsForFileAtPosition (#1593) Covers tasks 5.1, 5.2, 7.4, 7.5 from the manual-entity-anonymisation tasks.md. **New tests (12 cases, 34 assertions):** - `tests/Unit/Db/GdprEntityMapperTest.php` (4 cases) — existing matching row returned, null when no row, dedup-invariant violation logs a warning + returns the first row (with ADR-005 PII redaction asserted on the log payload), no-logger path silently safe. - `tests/Unit/Db/EntityRelationMapperTest.php` (8 cases) — adds to the pre-existing single construction-wiring test: insertBatch (iterates rows in order, empty rows are a no-op, propagates exceptions to the caller, defaults `createdAt`, keeps explicit `createdAt`), existsForFileAtPosition (true on matching row, false on empty fetch). **Production change (minimal, behaviour-preserving):** - `lib/Db/GdprEntityMapper::findOneByValueAndType` now calls `$this->findEntities(...)` instead of `parent::findEntities(...)`. Identical at runtime (the mapper does not override findEntities) but allows a test subclass to inject row sets without booting the QBMapper SQL chain. Same testability adjustment is unnecessary for `EntityRelationMapper` because the new methods use `$this->insert(...)` (already overridable) and a directly- mockable IQueryBuilder chain. **Approach notes (kept in tasks.md):** - Tasks 5.1 and 5.2 are marked done via code inspection — the relevant SELECTs (lib/Db/EntityRelationMapper.php:327, and the `findEntityIdsByValueForFile` companion) do not reference `detection_method` at all, so manual-method rows are picked up automatically. A unit assertion of "WHERE clause omits a column" would be brittle; the contract is better tested via the integration test (7.6, deferred). - Per the pre-existing EntityRelationMapperTest convention, "DB-heavy query behaviour is covered by integration tests" — these new tests target the row → entity field mapping + return-value contract, not SQL semantics. **Verification:** 102 tests / 310 assertions in the live `master-nextcloud-1` container across the change's full surface (`docker exec ... vendor/bin/phpunit --filter 'ChunkTextMatcherTest|ManualEntityServiceTest|FileTextController|GdprEntityMapperTest|EntityRelationMapperTest'`).
…, PHPCS cleanup on new tests (#1593) Covers tasks 5.1, 5.2, 8.1, 8.2, and the 9.4 named-parameter sweep from the manual-entity-anonymisation tasks.md. **Docs (8.1, 8.2):** - New `docs/Features/manual-entity-anonymisation.md` — sibling to `entity-relation-decision-metadata.md`. Covers endpoint contract, semantics (atomicity, idempotency, lookup-or-create, match flag defaults), audit-trail behaviour (`entity_create` + `entity_relations_batch_create`), anonymise-flow interaction (no `detection_method` filter), the value-keyed substitution caveat, PII redaction (ADR-005 + ADR-022 forensic exception), and the RBAC contract. - CHANGELOG entry under `## Unreleased → ### Added` with the `#1593` issue link. **PHPCS sweep on new tests (9.4):** - All 5 new test files (`ChunkTextMatcherTest`, `ManualEntityServiceTest`, `FileTextControllerManualEntityTest`, `GdprEntityMapperTest`, `EntityRelationMapperTest`) are now PHPCS-clean: phpcbf-driven blank-line normalisation, member-var docblocks added, `$this->exactly(N)` and `$this->createMock(C::class)` converted to named-args, two inline-comment capitalisations fixed, the `EntityRelation` instanceof guard added to the test-mapper's `insert` override to satisfy PHPStan's array-type refinement. - PHPStan clean on the full diff (new + modified production files + new + modified tests). - All 9 new production files (Service/File/*, Exception/*, Db/DetectionMethod, FileTextController, GdprEntityMapper, EntityRelationMapper) remain PHPCS-clean (no new violations introduced by this commit). - Out of scope: pre-existing PHPCS noise in legacy `FileTextControllerTest` / `FileTextControllerCoverageTest` / `FileTextControllerDeepTest` (~270 errors apiece, predates this change). Touched only the constructor wiring and the one stale `findEntitiesForFile` → `findEntitiesForAnonymization` rename per the previous commit; phpcbf-fixed what it could but did not chase the legacy backlog. - Out of scope: `appinfo/routes.php` (583 pre-existing errors, one 3-line addition by this change followed the existing file pattern). **Spec status (5.1, 5.2, 6.1):** - 5.1 + 5.2 marked done — verified by code inspection that `findEntitiesForAnonymization` and `findEntityIdsByValueForFile` do not filter on `detection_method`, so manual-method rows flow through unchanged. Integration-level confirmation is deferred to task 7.6. - 6.1 (update canonical `openspec/specs/entity-relation-grondslagen/spec.md`) is a no-op until the `entity-relation-grondslagen` change archives (no canonical capability spec exists yet — the spec lives in the in-flight delta at `openspec/changes/entity-relation-grondslagen/specs/entity-relation-grondslagen/spec.md`). Marked deferred in tasks.md. **Verification:** `openspec validate manual-entity-anonymisation` clean. PHPStan clean on all touched files. PHPUnit: 102 tests / 310 assertions all green in the `master-nextcloud-1` container.
) Marks 6.1, 9.1, 9.2, 9.4, 10.1, 10.3 with their final status + implementation notes. Remaining open items: - 7.6 (integration test against stacked OR) — deferred, needs running stack + chunk-fixture wiring. - 9.3 (manual smoke test) — operator-driven, out of scope. - 10.2 (DocuDesk tracking issue) — pending user confirmation before filing. 29 of 32 sub-tasks are now done; the remaining three are explicitly scoped out of this branch (integration test + operator smoke + cross-app issue filing).
#1593) Closes task 10.2. Filed [ConductionNL/docudesk#225](ConductionNL/docudesk#225) — `[OpenSpec] [docudesk] manual-entity-anonymisation-ui (PoC)`. PoC framing is explicit in the title + summary + task list ("stock NC components, no custom styling, no new deps") so the frontend team uses the work as a reference, not as a production implementation. 30 of 32 sub-tasks done. Remaining 2 are out of scope for this branch: 7.6 (integration test, needs a stacked OR run) and 9.3 (operator-driven manual smoke).
…-side; surface root cause on rollback (#1593) **Root cause of the operator-reported 500:** The `oc_openregister_entities` table has TWO `NOT NULL`-with-no-default columns (`category` and `updated_at`) that the existing detector flow (`EntityRecognitionHandler::findOrCreateEntity`) always populates. The new manual-entity flow only populated `detected_at`, so the insert failed at the DB layer with two consecutive `SQLSTATE[HY000] 1364 Field 'X' doesn't have a default value` errors — one per column. **Fix #1 — `category` is now server-derived, removed from the API:** - `EntityRecognitionHandler::getCategoryForType()` is now `public static` (was `private`) so producers outside the detector pipeline can derive the same mapping without taking on the handler's DI graph. - `ManualEntityService::lookupOrCreateEntity()` calls it on every new insert. Rationale: the operator UI rarely has the context to make a category decision, the detector flow already derives consistently, and exposing the field as operator-input was the original source of the bug (omitted = null = NOT NULL violation). Manual-entity rows and detector rows now share categories for identical types. - `category` removed from the request body, the service signature, the `writeAuditTrails` signature, the controller body-parsing, and the unit tests. The audit trail still records `category` in `changed.fields` — read from the entity after insert, so the persisted value is captured verbatim. - Operator-override on category is intentionally NOT exposed in v1; a follow-up if a concrete use case emerges. **Fix #2 — `updated_at` populated alongside `detected_at`:** - Same root cause class. Mirroring the detector flow's column population fixes it. **Fix #3 — operator-debuggable rollback path:** The service's `catch (Throwable $error)` block wrapped the original exception as `ManualEntityException(REASON_INTERNAL_ERROR)` without logging. Combined with the controller's `mapManualEntityException` only logging the reason code (not the previous-exception message), a 500 left no debuggable trail in `nextcloud.log` — exactly what the operator hit. Added a `LoggerInterface::error()` call inside the service's rollback handler that captures `errorType`, `error`, and `trace` of the underlying Throwable before re-wrapping. Future operator-reported 500s will have a stack-trace-level log entry keyed by reqId. **Spec + docs updates:** - `docs/Features/manual-entity-anonymisation.md` — request body shape loses the `category` line; added a note explaining the server-side derivation. - `openspec/changes/manual-entity-anonymisation/proposal.md` — example body updated; the 400 error hint updated. - `openspec/changes/manual-entity-anonymisation/design.md` — derivation rationale folded into the existing "lookup-or-create" section; the open-question on category resolved with a "not in the API at all" decision. - `openspec/changes/manual-entity-anonymisation/specs/entity-relation-grondslagen/spec.md` — Requirement updated, "Category mismatch does NOT trigger an update" scenario replaced with "New catalogue row carries the server-derived category". - `openspec validate manual-entity-anonymisation` clean. **Verification:** - PHPCS clean (4 touched files including the new logger call). - PHPStan clean on the full diff. - 44 unit tests / 145 assertions across `ManualEntityServiceTest|FileTextControllerManualEntityTest|ChunkTextMatcherTest|GdprEntityMapperTest|EntityRelationMapperTest` all green in the live `master-nextcloud-1` container. - End-to-end smoke test in the dev environment: - Empty-match POST → HTTP 200 with `message`, catalogue row created. - Match POST → HTTP 201 with `matchCount=4, matchesSkipped=2` (the 2 skipped were pre-existing detector hits at the same positions). - Idempotent re-POST → HTTP 201 with `matchCount=4, matchesSkipped=4`, `relations: []` (no duplicates). Refs: #1593
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full implementation of the
manual-entity-anonymisationchange. Extends theentity-relation-grondslagencapability with a write path for operator-supplied entities:POST /api/files/{fileId}/manual-entities: operator types text + entity type, server matches against the file'sTextExtractionServicechunks, creates aGdprEntitycatalogue entry (or reuses an existing one for the same(value, type)pair), and inserts oneEntityRelationper occurrence — all atomically.EntityRelationMapper::existsForFileAtPositionbefore insert; already-present rows bumpmatchesSkipped).wholeWord = true,caseSensitive = true(operator means exactly what they typed).message; operator intent is recorded in the audit trail regardless.isUpdateable()) — same model asPATCH /api/entity-relations/{id}.valueLengthonly, nevervalue. Error responses never echo the operator-supplied text.entity_create(only on new catalogue rows) +entity_relations_batch_create(every call, including zero-match).What's in this PR
openspec/changes/manual-entity-anonymisation/{proposal,design,tasks}.md+specs/entity-relation-grondslagen/spec.md.lib/Service/File/ChunkTextMatcher.php— chunk-awarepreg_match_all+ absolute-position dedup (overlap-region: lowestchunkIndexwins). Rejects needles > chunk overlap; PII-safe error messages.lib/Service/File/ManualEntityService.php— orchestrates the lookup-or-create + batch insert + audit-trail write inside oneIDBConnection::beginTransaction()window. Catches and logs the root cause on rollback before re-wrapping asManualEntityException.lib/Service/File/ManualEntityResult.php— DTO{entity, entityWasNew, relations, matchCount, matchesSkipped}.lib/Exception/{ChunkMatcherException,ManualEntityException}.php— typed reason codes (value_too_long,regex_compile_failure,file_not_extracted,unsupported_entity_type,internal_error).lib/Db/DetectionMethod.php— constants (PRESIDIO/OPENANONYMISER/PATTERN/MANUAL).lib/Db/GdprEntityMapper::findOneByValueAndType—(value, type)lookup with dedup-invariant warning + first-row fallback.lib/Db/EntityRelationMapper::existsForFileAtPosition+::insertBatch— idempotency probe + caller-managed-transaction batch insert.lib/Controller/FileTextController::addManualEntity— content-type guard, body validation, exception → status-code mapping, PII-redacted log.appinfo/routes.php—POST /api/files/{fileId}/manual-entities.categoryderivation: the endpoint does NOT accept acategoryfield; the column isNOT NULLand is populated fromtypeviaEntityRecognitionHandler::getCategoryForType()(lifted topublic staticfor shared use), matching the detector flow. This was a deliberate UX simplification + bug-class elimination (the operator UI rarely has the context to set category, and forgetting to set it was the original source of HTTP 500s during testing).docs/Features/manual-entity-anonymisation.md+ CHANGELOG entry under### Added.Cross-change dependency
This branch is based on
developmentdirectly. It does NOT include theentity-relation-grondslagenwork (PR #1503) — that's tracked separately. The two changes share theEntityRelationtable:skip_anonymizationandbasescolumns are added by Implement entity-relation-grondslagen — decision-metadata PATCH endpoint (#1435) #1503.EntityRelationMapper::insertBatchaccepts both keys in the row payload but silently ignores them on this branch (forward-compat reserve, documented in the row-payload docstring). Once Implement entity-relation-grondslagen — decision-metadata PATCH endpoint (#1435) #1503 lands,insertBatchshould gain the matchingsetBases/setSkipAnonymizationcalls so manual-entity callers can pre-populate operator decisions.findEntitiesForAnonymization/findSkippedEntityValuesForFile/updateDecisionMetadatabelong to Implement entity-relation-grondslagen — decision-metadata PATCH endpoint (#1435) #1503 and stay there. Manual-method relations flow through whatever anonymise pass exists ondevelopmentat merge time (currentlyfindEntitiesForFile), and pick up Implement entity-relation-grondslagen — decision-metadata PATCH endpoint (#1435) #1503's skip-aware filter automatically once that lands.feat/225/manual-entity-anonymisation-ui-pocbranch (separately, offtest/anonimiseren-bij-de-bron-dd).Verification
openspec validate manual-entity-anonymisation— clean.FileTextControllerTest.php+appinfo/routes.phpPHPCS noise is pre-existing ondevelopmentand untouched here).vendor/bin/phpunit --filter 'ChunkTextMatcherTest|ManualEntityServiceTest|FileTextController|GdprEntityMapperTest|EntityRelationMapperTest').valuenot in file) → HTTP 200 withmessage; catalogue row created.value: "Robert"on a meeting-notes file) → HTTP 201,matchCount: 4, 4 relations created.matchCount: 4,matchesSkipped: 4,relations: [](no duplicates).Test plan
openspec validate manual-entity-anonymisationvendor/bin/phpunit— 102 tests passdesign.md§D2)categoryderivation decision (server-side fromtype, not operator-supplied — seedesign.mdopen-questions resolution)EntityRelationcolumns + the 3 grondslagen mapper methods)