Skip to content

feat: persist duplicate detection results in PostgreSQL #85

@JohnRDOrazio

Description

@JohnRDOrazio

Problem

Duplicate detection results are currently cached only in Redis with a 10-minute TTL. When the page is reloaded, results are gone and the user must click "Find Duplicates" again (~56 seconds for a 15K-class ontology).

Proposal

  1. New duplicate_detection_results table — stores the latest detection results per project/branch, including clusters, threshold, and timestamp
  2. Auto-update on index rebuild — when run_ontology_index_task completes (or the index is updated via entity edits), automatically re-run duplicate detection and persist the results
  3. Frontend loads persisted results on mount — the Duplicates tab should check for stored results first, showing them immediately without requiring a manual "Find Duplicates" click
  4. "Find Duplicates" button re-runs detection — still available for on-demand refresh, updating the persisted results

Context

PR #80 moved duplicate detection to the ARQ worker queue and rewrote it to use PostgreSQL's pg_trgm GIN index instead of in-memory rdflib parsing. The detection itself now completes in ~56 seconds for large ontologies. Persisting results would make the UX seamless.

Tasks

  • Create duplicate_detection_results table and Alembic migration
  • Store results after run_duplicate_detection_task completes
  • Load persisted results in the Duplicates tab on mount (frontend)
  • Trigger duplicate detection automatically after run_ontology_index_task
  • Update "Find Duplicates" to refresh persisted results

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions