Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .github/workflows/api-test-coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,41 @@ jobs:
docker exec nextcloud su -s /bin/bash www-data -c "php /var/www/html/occ app:enable openregister"
docker exec nextcloud su -s /bin/bash www-data -c "php /var/www/html/occ app:list" | grep openregister

- name: Wait for Apache + OpenRegister API to serve HTTP
# `occ status` only proves the DB-side install completed; Apache may
# still not be answering on :8080. The Newman suite runs from the
# host and previously hit `[errored]` on every request because the
# HTTP listener wasn't ready yet. Poll both the NC entry page (proves
# Apache is up + port-forwarded) and the OpenRegister settings
# endpoint (proves the app is enabled + dispatching).
run: |
# 1. Apache reachable on the host-mapped port.
for i in $(seq 1 60); do
if curl -fsS -o /dev/null -w '' http://localhost:8080/status.php; then
echo "✓ Apache responding on localhost:8080 (try $i)"
break
fi
echo "Waiting for Apache... ($i/60)"
sleep 2
done

# 2. OpenRegister API dispatching (admin GET against a stable endpoint).
for i in $(seq 1 60); do
code=$(curl -s -o /dev/null -w '%{http_code}' \
-u admin:admin \
http://localhost:8080/index.php/apps/openregister/api/settings/rbac || echo "000")
if [ "$code" = "200" ] || [ "$code" = "401" ] || [ "$code" = "405" ]; then
echo "✓ OpenRegister API responding (HTTP $code, try $i)"
break
fi
echo "Waiting for OpenRegister API... (try $i, last HTTP=$code)"
sleep 2
done

# 3. Final sanity probe — fail fast if still down.
curl -fsS -u admin:admin http://localhost:8080/index.php/apps/openregister/api/settings/rbac -o /dev/null \
|| (echo "::error::OpenRegister API still not responding after wait" && exit 1)

- name: Run Newman orchestrator
env:
BASE_URL: http://localhost:8080
Expand Down
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@

### Added
- **EML (`message/rfc822`) support in `TextExtractionService`.** Two output paths share an underlying `zbateson/mail-mime-parser` invocation: (1) a flat plain-text path used by `extractFile` for entity detection — header block (`From` / `To` / `Cc` / `Subject` / `Date`), blank line, body (`text/plain` preferred over `text/html`-stripped-to-text), attachments listed under `--- Attachment: <filename> ---` markers; nested EML attachments are inlined via recursive flattening; (2) a new public `TextExtractionService::parseEmlStructured(File): EmlStructure` that returns headers + `EmlBody` (`plainText`, `html`) + array of `EmlAttachment` (decoded binary bytes — not the base64 transport string — plus filename / MIME / inline / contentId / `nestedEml`). Recursion is capped at depth 3 (root = depth 0; deeper `message/rfc822` attachments expose an `EmlAttachment` shell with `nestedEml = null`). `parseEmlStructured` MUST throw `EmlParseException` on irrecoverable malformed input — consumers (notably DocuDesk's `eml-pdf-assembly`) drive their fallback paths via exception propagation. Non-UTF-8 body parts are transcoded via `mb_detect_encoding` + `mb_convert_encoding`. Filename resolution: `Content-Disposition` `filename` → `Content-Type` `name` → generated `attachment-<n>` (1-indexed). Per ADR-005, parser-failure log lines are PII-sanitised — addresses, quoted strings, and angle-bracketed values are replaced with `<redacted>` before logging. New dependency: `zbateson/mail-mime-parser:^3.0`. (`text-extraction-eml`)
- **`PATCH /api/entity-relations/{id}` — operator decision-metadata endpoint.** The `EntityRelation` row now carries two operator-decision fields: `bases` (nullable JSON array of UUIDs referencing legal grondslagen) and `skip_anonymization` (boolean, default false). The new endpoint accepts a JSON body with a strict whitelist of `{bases?, skipAnonymization?}` — any other key returns HTTP 400. A parallel DI method `EntityRelationMapper::updateDecisionMetadata(EntityRelation $relation, array $fields, ?IUser $actingUser = null): EntityRelation` provides the same contract for in-process callers (e.g. DocuDesk's anonymise flow). The row UPDATE and audit-trail INSERT run inside a single DB transaction so an audit-INSERT failure rolls back the UPDATE and surfaces as HTTP 500 — clients never observe a persisted decision-metadata change without a matching audit entry. The `bases` diff is multiset-equal (order-insensitive, duplicates collapsed) so cosmetic reorderings do not produce spurious audit entries; `null` vs `[]` remains distinct. Successful writes that produce a non-empty diff emit one immutable audit-trail entry (ADR-022) capturing the acting user UID (per ADR-005, the UID — not the display name), timestamp, row identifier, and per-field previous → new values; semantic no-ops (PATCH with values identical to current state, or empty body) succeed with HTTP 200 and write no audit entry. The endpoint requires write-access to the relation's parent file/object — the same implicit check that `markAsAnonymized` inherits today — and is `@NoAdminRequired`. The post-hoc system fields `anonymized` and `anonymizedValue` are intentionally NOT in the whitelist; those record what the redaction code path actually did and remain writable only by `markAsAnonymized`. (`entity-relation-grondslagen`)

### Behaviour changes
- **EML files that previously produced null / empty extracted-text now produce populated text.** Files with `mimetype: message/rfc822` are now extracted via the new EML pipeline rather than being silently skipped. The flat output is suitable for the existing entity-detection pipeline. Tenants that relied on EMLs being skipped (unlikely) need to revisit their downstream flows. Non-EML extractable attachments (PDF, DOCX, text within an EML) are listed by filename + MIME type only in v1 — inline text extraction for those types is deferred (the consumer-side `eml-pdf-assembly` handles rich rendering separately). (`text-extraction-eml`)
- **Anonymise flow now honours `skip_anonymization`.** Rows where `skip_anonymization = true` are excluded from the redaction pass: `EntityRelationMapper::markAsAnonymized` no longer flips `anonymized = true` on skipped rows (added `AND skip_anonymization = 0` predicate); `FileTextController::anonymizeFile` reads relations through the new skip-aware `EntityRelationMapper::findEntitiesForAnonymization` method; `DocumentProcessingHandler::anonymizeDocument` defensively filters out skipped occurrences server-side before text-replacement so the OR contract ("skipped relations are never redacted, full stop") holds regardless of caller behaviour. Skipped rows retain `anonymized = false` after the file's anonymise pass — the operator decision is preserved and queryable via the flag. Files with no skipped rows see behaviour identical to pre-change. The `skip_anonymization` flag is forward-looking: flipping it to `true` on an already-anonymised row does not retroactively un-redact the file. (`entity-relation-grondslagen`)

### Breaking Changes
- **`@self.files` on rendered objects is now opt-in for full file metadata.** By default, `@self.files` is a lightweight list of integer file IDs (`[123, 456, 789]`). Consumers that need full file metadata (`id`, `path`, `title`, `accessUrl`, `downloadUrl`, `type`, `extension`, `size`, `hash`, `published`, `modified`, `labels`) MUST add `_extend[]=@self.files` (or the equivalent shorthand `_extend[]=_files`) to their request. The change applies to **every** consumer of OpenRegister's render output, including `show` endpoints in dependent apps (e.g. opencatalogi `/publications/{catalogSlug}/{id}`). Migration is a one-line query parameter addition. The previous behavior — full metadata always served on show, no metadata on list — caused asymmetric responses across endpoints and paid the file-lookup cost on every show response regardless of need. The new contract is symmetric across show and list endpoints (both emit `@self.files` as IDs by default; both accept `_extend[]=@self.files` for full metadata) and is documented under the `files-render-extension` capability. **Note:** Using `_extend[]=@self.files` (or `_files`) on **list** endpoints is heavily discouraged because it triggers per-row file/tag lookups (N+1 queries scaling with page size) and will result in degraded performance. Use it only when full file metadata is genuinely required for every row. **SOLR limitation:** on SOLR/index-backed list endpoints, `_extend[]=@self.files` is not yet supported; the lightweight ID list is always returned and the response carries `@self.extend_unsupported: ["@self.files"]` so consumers can detect the mismatch programmatically. Use the database-backed path when full file metadata is required on lists.
Expand Down
3 changes: 3 additions & 0 deletions appinfo/routes.php
Original file line number Diff line number Diff line change
Expand Up @@ -610,6 +610,9 @@
// File Anonymization - Replace detected entities with placeholders.
['name' => 'fileText#anonymizeFile', 'url' => '/api/files/{fileId}/anonymize', 'verb' => 'POST', 'requirements' => ['fileId' => '\\d+']],

// Entity Relations - Decision-metadata PATCH (bases + skipAnonymization). See `entity-relation-grondslagen`.
['name' => 'entityRelations#update', 'url' => '/api/entity-relations/{id}', 'verb' => 'PATCH', 'requirements' => ['id' => '\\d+']],

// GDPR Entities - Manage detected PII entities.
['name' => 'gdprEntities#index', 'url' => '/api/entities', 'verb' => 'GET'],
['name' => 'gdprEntities#show', 'url' => '/api/entities/{id}', 'verb' => 'GET', 'requirements' => ['id' => '\\d+']],
Expand Down
88 changes: 88 additions & 0 deletions docs/Features/entity-relation-decision-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Entity-Relation Decision Metadata

OpenRegister exposes an audited `PATCH /api/entity-relations/{id}` endpoint (and a parallel DI mapper method `EntityRelationMapper::updateDecisionMetadata`) for setting operator decisions on detected-entity occurrences. The two decision fields are:

- **`bases`** — `?array<string>` — UUIDs referencing legal grondslagen (Woo Art. 5 / equivalent) that justify redacting the occurrence. OpenRegister persists the UUIDs verbatim and does not validate that they resolve — the vocabulary is owned by the consumer app (DocuDesk's `dossier` register is the first consumer).
- **`skipAnonymization`** — `bool` (default `false`) — when `true`, the anonymise pass excludes this occurrence: text-replacement skips it, and `EntityRelationMapper::markAsAnonymized` leaves `anonymized = false` on the row.

These are **decision-only** fields. The post-hoc system fields `anonymized` and `anonymizedValue` (which record what the redaction code path actually did) are intentionally NOT in the PATCH whitelist; only `EntityRelationMapper::markAsAnonymized` writes them.

## Endpoint contract

```
PATCH /api/entity-relations/{id}
Content-Type: application/json
Body: { "bases"?: null | string[], "skipAnonymization"?: boolean }
```

- **200**: returned on success, body is the updated `EntityRelation` (`jsonSerialize` shape).
- **400**: shape or whitelist violation. Body: `{"error": "<message>", "details": {"field": "<name>", "reason"?: "<code>"}}`. Triggered by:
- Any non-whitelisted top-level key (e.g. `anonymized`, `entityId`).
- `bases` not `null` or `array<string>`.
- `skipAnonymization` not boolean.
- **401**: no authenticated session.
- **403**: acting user lacks write-access to the relation's parent file (or object/email). For file-bound relations the check resolves the file through the user-folder and requires `isUpdateable()` to be `true`. Object- and email-bound relations are accepted with a warning log in v1; tightening tracked as a follow-up.
- **404**: `{id}` does not resolve to an existing relation.
- **500**: unexpected failure during the write.

The endpoint is `@NoAdminRequired` — non-admins can PATCH relations they have write-access to.

## Semantics

- **Single audited write path.** Both the HTTP controller and in-process DI callers go through `EntityRelationMapper::updateDecisionMetadata`. There is no parallel write path that bypasses validation or the audit trail.
- **Diff-aware.** Only fields whose new value differs from the current row state contribute to the update and the audit entry. A PATCH where every supplied value matches the current state, or an empty body `{}`, returns 200 with the unchanged row and writes NO audit entry.
- **Three-way `bases` semantics.** Field absent → unchanged; `"bases": null` → cleared; `"bases": []` → set to empty array (distinct from null per the spec); `"bases": ["..."]` → set to the array.
- **Audit-trail entry** (per successful change):
```
action = "entity_relation_decision_updated"
user = acting user UID (ADR-005 — NEVER the display name in the structured changed-fields payload)
created = now (UTC)
changed.subjectType = "openregister_entity_relations"
changed.subjectId = <relation id>
changed.fields = { "<field>": { "previous": <old>, "new": <new> } } — only fields that actually changed
```
Reads of `EntityRelation` rows produce no audit entries.

## How callers use it

**HTTP** (DocuDesk frontend, batch tools, scripts):

```http
PATCH /api/entity-relations/123
{ "bases": ["b8a3-..."], "skipAnonymization": false }
```

**PHP DI** (DocuDesk's `AnonymizationService`, OpenConnector pipelines, anywhere in OR's process):

```php
$mapper = $this->getOpenRegisterService('OCA\OpenRegister\Db\EntityRelationMapper');
$mapper->updateDecisionMetadata(
id: 123,
fields: ['bases' => ['b8a3-...'], 'skipAnonymization' => false],
actingUser: $this->userSession->getUser()
);
```

DocuDesk specifically uses the DI path for its prohibition-override flow: when an operator acknowledges an override, DocuDesk writes its own audit entry (capturing the operator's reason) and then PATCHes the relation with `skipAnonymization=true` via this DI method — so OR's anonymise pass automatically excludes the released entity. See [DocuDesk `anonymisation-grondslagen-and-prohibition-gate`](https://github.com/ConductionNL/docudesk/pull/135).

## Anonymise-flow interaction

The new field changes the behaviour of two existing code paths:

1. **`POST /api/files/:fileId/anonymize` (HTTP)** — `FileTextController::anonymizeFile` reads relations through `EntityRelationMapper::findEntitiesForAnonymization`, which adds `AND skip_anonymization = 0` to the existing `findEntitiesForFile` query. Skipped relations are not in the replacements list and are not flipped by `markAsAnonymized`.
2. **`FileService::anonymizeDocument(Node, entities[])` (DI)** — the underlying `DocumentProcessingHandler::anonymizeDocument` defensively filters the caller-supplied `entities[]` against `EntityRelationMapper::findSkippedEntityValuesForFile($fileId)`. Even if the caller includes skipped occurrences in the array, OR drops them server-side before text-replacement. Contract: "skipped relations are never redacted, full stop."

After the anonymise call:

- Non-skipped relations: `anonymized = true`, `anonymizedValue = <placeholder>` (existing behaviour).
- Skipped relations: `anonymized = false`, the operator's `skipAnonymization = true` flag is preserved.

`skipAnonymization` is **forward-looking**: flipping it to `true` on an already-anonymised row does not retroactively un-redact the file. The redaction has already happened in the file content; only future re-runs honour the flag.

## Spec references

- Capability: [`openspec/changes/entity-relation-grondslagen/specs/entity-relation-grondslagen/spec.md`](../../openspec/changes/entity-relation-grondslagen/specs/entity-relation-grondslagen/spec.md)
- Design (anonymise flow, audit, authz, two-column migration): [`openspec/changes/entity-relation-grondslagen/design.md`](../../openspec/changes/entity-relation-grondslagen/design.md)
- ADR-022 (audit-trail for OR-owned mutations).
- ADR-005 (no PII in logs; UID not display name in audit payloads).
- ADR-023 (action-level authorization — opt-in; not introduced here).
19 changes: 19 additions & 0 deletions lib/AppInfo/Application.php
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,25 @@ function (ContainerInterface $container) {
}
);

// EntityRelationMapper is registered explicitly because it constructor-injects
// `IEventDispatcher` to dispatch `EntityRelationDecisionUpdatedEvent`. Every
// other event-dispatcher-dependent mapper in this method (SchemaMapper,
// RegisterMapper, MagicMapper, WebhookMapper) is wired explicitly to avoid
// depending on NC's autowirer resolution of framework-interface keys, which
// has historically been version-sensitive.
$context->registerService(
EntityRelationMapper::class,
function (ContainerInterface $container) {
return new EntityRelationMapper(
db: $container->get('OCP\IDBConnection'),
auditTrailMapper: $container->get(\OCA\OpenRegister\Db\AuditTrailMapper::class),
userSession: $container->get('OCP\IUserSession'),
eventDispatcher: $container->get('OCP\EventDispatcher\IEventDispatcher'),
logger: $container->get('Psr\Log\LoggerInterface')
);
}
);

}//end registerMappersWithCircularDependencies()

/**
Expand Down
Loading
Loading