Skip to content

feat: add index to speed up reindexFrom#1775

Open
andersju wants to merge 1 commit into
developfrom
feature/add-index-for-faster-reindexfrom
Open

feat: add index to speed up reindexFrom#1775
andersju wants to merge 1 commit into
developfrom
feature/add-index-for-faster-reindexfrom

Conversation

@andersju
Copy link
Copy Markdown
Member

@andersju andersju commented May 26, 2026

Currently reindexFrom at the end of each reindex takes about 20 minutes or so, even when there's almost nothing to reindex, because of very slow SQL queries. So let's add an index to make them not slow. tl;dr: 100x speedup, 520 rows scanned instead of 20 million (on a QA example query).

For each collection we do a query like this:

SELECT id, data, created, modified, deleted
      FROM lddb
      WHERE GREATEST(modified, (data#>>'{@graph,0,generationDate}')::timestamptz) >= '2026-05-25 11:57:51+02'
        AND GREATEST(modified, (data#>>'{@graph,0,generationDate}')::timestamptz) <= 'infinity'
        AND collection = 'bib'
        AND deleted = false;

Previously on QA:

EXPLAIN ANALYZE SELECT id, data, created, modified, deleted
      FROM lddb
      WHERE GREATEST(modified, (data#>>'{@graph,0,generationDate}')::timestamptz) >= '2026-05-25 11:57:51+02'
        AND GREATEST(modified, (data#>>'{@graph,0,generationDate}')::timestamptz) <= 'infinity'
        AND collection = 'bib'
        AND deleted = false;
   Gather  (cost=221799.77..27499839.54 rows=47163 width=1229) (actual time=7681.396..262575.392 rows=244 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on lddb  (cost=220799.77..27494123.24 rows=19651 width=1229) (actual time=7660.604..262056.823 rows=81 loops=3)
         Recheck Cond: (collection = 'bib'::text)
         Filter: ((NOT deleted) AND (GREATEST(modified, ((data #>> '{@graph,0,generationDate}'::text[]))::timestamp with time zone) >= '2026-05-25 11:57:51+02'::timestamp with time zone) AND (GREATEST(modified, ((data #>> '{@graph,0,generationDate}'::text[]))::timestamp with time zone) <= 'infinity'::timestamp with time zone))
         Rows Removed by Filter: 6686232
         Heap Blocks: exact=2832009
         ->  Bitmap Index Scan on idx_lddb_collection  (cost=0.00..220787.98 rows=19408455 width=0) (actual time=3103.190..3103.191 rows=20079895 loops=1)
               Index Cond: (collection = 'bib'::text)
 Planning Time: 0.162 ms
 Execution Time: 262576.044 ms

QA with the new index (note: it's no longer there, I added it temporarily for testing):

EXPLAIN ANALYZE SELECT id, data, created, modified, deleted
      FROM lddb
      WHERE GREATEST(modified, totstz(data#>>'{@graph,0,generationDate}')) >= '2026-05-25 11:57:51+02'
        AND GREATEST(modified, totstz(data#>>'{@graph,0,generationDate}')) <= 'infinity'
        AND collection = 'bib'
        AND deleted = false;

 Bitmap Heap Scan on lddb  (cost=241772.62..677457.22 rows=48754 width=1229) (actual time=2557.584..2599.301 rows=244 loops=1)
   Recheck Cond: ((GREATEST(modified, totstz((data #>> '{@graph,0,generationDate}'::text[]))) >= '2026-05-25 11:57:51+02'::timestamp with time zone) AND (GREATEST(modified, totstz((data #>> '{@graph,0,generationDate}'::text[]))) <= 'infinity'::timestamp with time zone) AND (collection = 'bib'::text))
   Filter: (NOT deleted)
   Heap Blocks: exact=154
   ->  BitmapAnd  (cost=241772.62..241772.62 rows=100318 width=0) (actual time=2556.991..2556.993 rows=0 loops=1)
         ->  Bitmap Index Scan on idx__lddb_greatest_modified  (cost=0.00..16047.11 rows=774254 width=0) (actual time=0.104..0.104 rows=520 loops=1)
               Index Cond: ((GREATEST(modified, totstz((data #>> '{@graph,0,generationDate}'::text[]))) >= '2026-05-25 11:57:51+02'::timestamp with time zone) AND (GREATEST(modified, totstz((data #>> '{@graph,0,generationDate}'::text[]))) <= 'infinity'::timestamp with time zone))
         ->  Bitmap Index Scan on idx_lddb_collection  (cost=0.00..225700.89 rows=20063509 width=0) (actual time=2531.675..2531.675 rows=20079895 loops=1)
               Index Cond: (collection = 'bib'::text)
 Planning Time: 1.323 ms
 Execution Time: 2599.357 ms

Note that I also changed from ::timestamptz to our totstz() function because it's not possible to create an index with ::timestamptz as it's not marked IMMUTABLE (unlike totstz()).

Copy link
Copy Markdown
Contributor

@jannistsiroyannis jannistsiroyannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we already had a:
idx__lddb_greatest_modified
and it looks like we do:
"idx__lddb_greatest_modified" btree (GREATEST(modified, totstz(data #>> '{@graph,0,generationDate}'::text[])))

I may be mistaken. Need to look closer.

@jannistsiroyannis
Copy link
Copy Markdown
Contributor

jannistsiroyannis commented May 26, 2026

I think I was wrong, you might have added it to dev in advance ?

@jannistsiroyannis
Copy link
Copy Markdown
Contributor

Ah my mistake, we had it, but only on the versions-table.

Copy link
Copy Markdown
Contributor

@jannistsiroyannis jannistsiroyannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff (the size of the index), is perhaps worth the saved reindexing time. Even if reindexing is only done occasionally. 🤷

@andersju
Copy link
Copy Markdown
Member Author

andersju commented May 26, 2026

The new index is about 3.2 GB (out of 59 GB of indices just for the lddb table, and about 337 GB of indices in total). I think it's probably reasonable. An alternative is a UNION like this which would use only pre-existing indices:

EXPLAIN ANALYZE  SELECT id, data, created, modified, deleted FROM lddb
  WHERE modified >= '2026-05-25 11:57:51+02' AND modified <= 'infinity' AND collection = 'bib' AND deleted = false
  UNION
  SELECT id, data, created, modified, deleted FROM lddb
  WHERE totstz(data#>>'{@graph,0,generationDate}') >= '2026-05-25 11:57:51+02' AND totstz(data#>>'{@graph,0,generationDate}') <= 'infinity' AND collection = 'bib' AND deleted = false;

 HashAggregate  (cost=503819.92..503915.55 rows=9563 width=81) (actual time=6.439..6.556 rows=244 loops=1)
   Group Key: lddb.id, lddb.data, lddb.created, lddb.modified, lddb.deleted
   Batches: 1  Memory Usage: 665kB
   ->  Append  (cost=0.57..503700.38 rows=9563 width=81) (actual time=0.025..1.632 rows=244 loops=1)
         ->  Index Scan using idx_lddb_modified on lddb  (cost=0.57..261042.49 rows=4791 width=1229) (actual time=0.025..0.474 rows=244 loops=1)
               Index Cond: ((modified >= '2026-05-25 11:57:51+02'::timestamp with time zone) AND (modified <= 'infinity'::timestamp with time zone))
               Filter: ((NOT deleted) AND (collection = 'bib'::text))
               Rows Removed by Filter: 276
         ->  Index Scan using idx_lddb_generation_date on lddb lddb_1  (cost=0.57..242514.44 rows=4772 width=1229) (actual time=1.134..1.134 rows=0 loops=1)
               Index Cond: ((totstz((data #>> '{@graph,0,generationDate}'::text[])) >= '2026-05-25 11:57:51+02'::timestamp with time zone) AND (totstz((data #>> '{@graph,0,generationDate}'::text[])) <= 'infinity'::timestamp with time zone))
               Filter: ((NOT deleted) AND (collection = 'bib'::text))
               Rows Removed by Filter: 416
 Planning Time: 0.264 ms
 Execution Time: 6.652 ms

But it'd require a little messing with the code, so a new index is still the cleaner solution, I'd say. Probably. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants