Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ and this project adheres to

### Changed

- ✨(backend) improve indexing command
- checkpoint recovery
- asynchronicity
- admin command trigger
Comment on lines +17 to +20
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Move this entry to [Unreleased].

This PR is still open and v4.8.3 is already dated 2026-03-23, so adding it under that release will make the changelog inaccurate for the next release.

📝 Proposed changelog placement
 ## [Unreleased]
 
 ### Changed
 
+- ✨(backend) improve indexing command
+  - checkpoint recovery
+  - asynchronicity
+  - admin command trigger
 - 💄(frontend) improve comments highlights `#1961`
 
 ## [v4.8.3] - 2026-03-23
 
 ### Changed
 
-- ✨(backend) improve indexing command
-  - checkpoint recovery 
-  - asynchronicity 
-  - admin command trigger
 - ♿️(frontend) improve version history list accessibility `#2033`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- ✨(backend) improve indexing command
- checkpoint recovery
- asynchronicity
- admin command trigger
## [Unreleased]
### Changed
- ✨(backend) improve indexing command
- checkpoint recovery
- asynchronicity
- admin command trigger
- 💄(frontend) improve comments highlights `#1961`
## [v4.8.3] - 2026-03-23
### Changed
- ♿️(frontend) improve version history list accessibility `#2033`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 17 - 20, Move the changelog entry "✨(backend)
improve indexing command" out of the dated v4.8.3 section and place it under the
top-level "[Unreleased]" heading; specifically, locate the block containing
"✨(backend) improve indexing command" with its subpoints "checkpoint recovery",
"asynchronicity", and "admin command trigger" and cut/paste it beneath the
"[Unreleased]" header so the entry is no longer listed under the v4.8.3 dated
release.

- ♿️(frontend) improve version history list accessibility #2033
- ♿(frontend) focus skip link on headings and skip grid dropzone #1983
- ♿️(frontend) add sr-only format to export download button #2088
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ demo: ## flush db then create a demo for load testing purpose
.PHONY: demo

index: ## index all documents to remote search
@$(MANAGE) index
@$(MANAGE) index $(args)
.PHONY: index

# Nota bene: Black should come after isort just in case they don't agree...
Expand Down
65 changes: 65 additions & 0 deletions docs/commands/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Index Command

The `index` management command is used to index documents to the remote search indexer.

It sends an asynchronous task to the Celery worker.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Misleading top-level description.

The command runs synchronously by default; it only dispatches to Celery when --async is passed. This sentence implies asynchronous is the default, which contradicts the argparse default (async_mode=False) in src/backend/core/management/commands/index.py line 54.

📝 Suggested fix
-It sends an asynchronous task to the Celery worker.
+By default, it indexes documents synchronously. Pass `--async` to dispatch the work to a Celery worker instead.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
It sends an asynchronous task to the Celery worker.
By default, it indexes documents synchronously. Pass `--async` to dispatch the work to a Celery worker instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/commands/index.md` at line 5, The top-level sentence in
docs/commands/index.md is misleading; update the description to state the
command runs synchronously by default and only dispatches to Celery when the
--async flag is passed (matching the argparse default async_mode=False in the
Command implementation in src/backend/core/management/commands/index.py);
reference the --async flag and the Command class/async_mode setting to ensure
docs and code behavior are consistent.


## Usage

### Make Command

```bash
# Basic usage with defaults
make index

# With custom parameters
make index args="--batch-size 100 --lower-time-bound 2024-01-01T00:00:00 --upper-time-bound 2026-01-01T00:00:00"

```

### Command line

```bash
python manage.py index \
--lower-time-bound "2024-01-01T00:00:00" \
--upper-time-bound "2024-01-31T23:59:59" \
--batch-size 200 \
--async_mode
```
Comment on lines +22 to +28
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrong flag name: use --async, not --async_mode.

The argparse argument name is --async (with dest="async_mode"). Users copy-pasting this example will hit error: unrecognized arguments: --async_mode.

📝 Suggested fix
 python manage.py index \
   --lower-time-bound "2024-01-01T00:00:00" \
   --upper-time-bound "2024-01-31T23:59:59" \
   --batch-size 200 \
-  --async_mode
+  --async
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
python manage.py index \
--lower-time-bound "2024-01-01T00:00:00" \
--upper-time-bound "2024-01-31T23:59:59" \
--batch-size 200 \
--async_mode
```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/commands/index.md` around lines 22 - 28, Update the example CLI
invocation to use the correct flag name: replace the incorrect `--async_mode`
with `--async` so it matches the argparse definition (`dest="async_mode"`);
ensure any other examples showing the async flag use `--async` (not
`--async_mode`) to avoid "unrecognized arguments" errors when running the
`index` command.


### Django Admin

The command is available in the Django admin interface:

1. Go to `/admin/core/run-indexing/`, you arrive at the "Run Indexing Command" page
3. Fill in the form with the desired parameters
4. Click **"Run Indexing Command"**
Comment on lines +34 to +36
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Broken ordered list numbering and inconsistent button label.

  • Steps jump from 1. to 3. to 4. (missing 2) — also flagged by markdownlint (MD029).
  • Step 4 says click "Run Indexing Command", but the template submit button value is Run Indexing (see src/backend/core/templates/runindexing.html line 19).
📝 Suggested fix
-1. Go to `/admin/core/run-indexing/`, you arrive at the "Run Indexing Command" page
-3. Fill in the form with the desired parameters
-4. Click **"Run Indexing Command"**
+1. Go to `/admin/core/run-indexing/`, you arrive at the "Run Indexing Command" page
+2. Fill in the form with the desired parameters
+3. Click **"Run Indexing"**
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
1. Go to `/admin/core/run-indexing/`, you arrive at the "Run Indexing Command" page
3. Fill in the form with the desired parameters
4. Click **"Run Indexing Command"**
1. Go to `/admin/core/run-indexing/`, you arrive at the "Run Indexing Command" page
2. Fill in the form with the desired parameters
3. Click **"Run Indexing"**
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 35-35: Ordered list item prefix
Expected: 2; Actual: 3; Style: 1/2/3

(MD029, ol-prefix)


[warning] 36-36: Ordered list item prefix
Expected: 3; Actual: 4; Style: 1/2/3

(MD029, ol-prefix)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/commands/index.md` around lines 34 - 36, Fix the broken ordered list and
inconsistent button label: update the numbered steps in docs/commands/index.md
so they read 1, 2, 3, 4 (insert the missing step 2 or renumber sequentially) and
make the button text in the docs match the actual submit value used in the
template (runindexing.html) — either change the docs to say "Run Indexing" or
update the template's submit value to "Run Indexing Command" so both the
markdown steps and the template submit button value are consistent.


## Parameters

### `--batch-size`
- **type:** Integer
- **default:** `settings.SEARCH_INDEXER_BATCH_SIZE`
- **description:** Number of documents to process per batch. Higher values may improve performance but use more memory.

### `--lower-time-bound`
- **optional**: true
- **type:** ISO 8601 datetime string
- **default:** `None`
- **description:** Only documents updated after this date will be indexed.

### `--upper-time-bound`
- **optional**: true
- **type:** ISO 8601 datetime string
- **default:** `None`
- **description:** Only documents updated before this date will be indexed.

## `--async_mode`
- **type:** Boolean flag
- **default:** `False`
- **description:** Runs asynchronously is async_mode==True.
Comment on lines +57 to +60
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Section heading uses wrong flag name and has a grammar/typo issue; also missing heading level.

Several problems on these lines:

  • Heading is ## --async_mode — the actual CLI flag is --async.
  • The heading is ## but surrounding parameter headings are ### (inconsistent hierarchy — this should be nested under "Parameters").
  • "Runs asynchronously is async_mode==True." is ungrammatical.
📝 Suggested fix
-## `--async_mode`
+### `--async`
 - **type:** Boolean flag
 - **default:** `False`
-- **description:** Runs asynchronously is async_mode==True. 
+- **description:** When set, dispatches the indexing job to a Celery worker instead of running it synchronously.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## `--async_mode`
- **type:** Boolean flag
- **default:** `False`
- **description:** Runs asynchronously is async_mode==True.
### `--async`
- **type:** Boolean flag
- **default:** `False`
- **description:** When set, dispatches the indexing job to a Celery worker instead of running it synchronously.
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 57-57: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/commands/index.md` around lines 57 - 60, Replace the incorrect heading
`## --async_mode` with the correct CLI flag name and heading level by changing
it to `### --async` to match other parameter headings; fix the description line
("Runs asynchronously is async_mode==True.") to a clear grammatical sentence
such as "Runs asynchronously when the --async flag is set." and keep the type as
"Boolean flag" and default "False" unchanged.


## Crash Safe Mode

The command saves the updated.at of the last document of each successful batch into the `bulk-indexer-checkpoint` cache variable.
If the process crashes, this value can be used as `lower-time-bound` to resume from the last successfully indexed document.
Comment on lines +64 to +65
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: updated.atupdated_at.

The field is updated_at (underscore), not updated.at.

📝 Suggested fix
-The command saves the updated.at of the last document of each successful batch into the `bulk-indexer-checkpoint` cache variable.
+The command saves the `updated_at` of the last document of each successful batch into the `bulk-indexer-checkpoint` cache variable.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The command saves the updated.at of the last document of each successful batch into the `bulk-indexer-checkpoint` cache variable.
If the process crashes, this value can be used as `lower-time-bound` to resume from the last successfully indexed document.
The command saves the `updated_at` of the last document of each successful batch into the `bulk-indexer-checkpoint` cache variable.
If the process crashes, this value can be used as `lower-time-bound` to resume from the last successfully indexed document.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/commands/index.md` around lines 64 - 65, Update the documentation text
to use the correct field name `updated_at` (underscore) instead of `updated.at`;
specifically edit the sentence that describes saving the last document's
timestamp into the `bulk-indexer-checkpoint` cache and its use as the
`lower-time-bound` so it reads "updated_at" where the field name is referenced.

81 changes: 80 additions & 1 deletion src/backend/core/admin.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,55 @@
"""Admin classes and registrations for core app."""

from datetime import datetime

from django.contrib import admin, messages
from django.contrib.auth import admin as auth_admin
from django.shortcuts import redirect
from django.core.management import call_command
from django.http import HttpRequest
from django.shortcuts import redirect, render
from django.utils.translation import gettext_lazy as _

from treebeard.admin import TreeAdmin

from core import models
from core.forms import RunIndexingForm
from core.tasks.user_reconciliation import user_reconciliation_csv_import_job


# Customize the default admin site's get_app_list method
_original_get_app_list = admin.site.get_app_list


def custom_get_app_list(self, request, app_label=None):
"""Add custom commands to the app list."""
app_list = _original_get_app_list(request, app_label)

# Add Commands app with Run Indexing command
commands_app = {
"name": _("Commands"),
"app_label": "commands",
"app_url": "#",
"has_module_perms": True,
"models": [
{
"name": _("Run indexing"),
"object_name": "RunIndexing",
"admin_url": "/admin/core/run-indexing/",
"view_only": False,
"add_url": None,
"change_url": None,
}
],
}

app_list.append(commands_app)
return app_list


# Monkey-patch the admin site
admin.site.get_app_list = custom_get_app_list.__get__(admin.site, admin.AdminSite)


@admin.register(models.User)
class UserAdmin(auth_admin.UserAdmin):
"""Admin class for the User model"""
Expand Down Expand Up @@ -227,3 +266,43 @@ class InvitationAdmin(admin.ModelAdmin):
def save_model(self, request, obj, form, change):
obj.issuer = request.user
obj.save()


def run_indexing_view(request: HttpRequest):
"""Custom admin view for running indexing commands."""
if request.method == "POST":
form = RunIndexingForm(request.POST)
if form.is_valid():
call_command(
"index",
batch_size=int(request.POST.get("batch_size")),
lower_time_bound=convert_to_isoformat(
request.POST.get("lower_time_bound")
),
upper_time_bound=convert_to_isoformat(
request.POST.get("upper_time_bound")
),
async_mode=True,
)
Comment on lines +274 to +286
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use form.cleaned_data instead of re-reading request.POST.

After form.is_valid(), the form has already validated and coerced batch_size to int and the time bounds to datetime objects. Going back to request.POST.get(...) and re-parsing via convert_to_isoformat duplicates work, bypasses the form's cross-field validation results, and reintroduces string parsing that could diverge from what the form accepted. Use the cleaned values directly.

♻️ Suggested fix
         if form.is_valid():
+            lower = form.cleaned_data.get("lower_time_bound")
+            upper = form.cleaned_data.get("upper_time_bound")
             call_command(
                 "index",
-                batch_size=int(request.POST.get("batch_size")),
-                lower_time_bound=convert_to_isoformat(
-                    request.POST.get("lower_time_bound")
-                ),
-                upper_time_bound=convert_to_isoformat(
-                    request.POST.get("upper_time_bound")
-                ),
+                batch_size=form.cleaned_data["batch_size"],
+                lower_time_bound=lower.isoformat() if lower else None,
+                upper_time_bound=upper.isoformat() if upper else None,
                 async_mode=True,
             )

With this, convert_to_isoformat can be removed entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/core/admin.py` around lines 274 - 286, After validating
RunIndexingForm, stop re-reading request.POST and use form.cleaned_data for
batch_size, lower_time_bound, and upper_time_bound when calling
call_command("index"); replace int(request.POST.get("batch_size")) and
convert_to_isoformat(request.POST.get(...)) with the form.cleaned_data values
(e.g., form.cleaned_data["batch_size"], form.cleaned_data["lower_time_bound"],
form.cleaned_data["upper_time_bound"]) so you rely on the form's coercion and
cross-field validation; remove or stop using convert_to_isoformat for these
parameters if the form already returns the correct datetime/typed values.

messages.success(request, _("Indexing triggered!"))
else:
messages.error(request, _("Please correct the errors below."))
else:
form = RunIndexingForm()
Comment on lines +271 to +291
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: the view has no authentication or permission check.

run_indexing_view is wired at /admin/core/run-indexing/ but lacks any @staff_member_required, @login_required, or permission gate. Any unauthenticated visitor who knows (or guesses) the URL and obtains a CSRF token (trivially, from a GET of the same page) can POST and trigger a full-corpus reindex via Celery. This is a privilege-escalation/DoS vector that must be fixed before merging.

🔒 Suggested fix
+from django.contrib.admin.views.decorators import staff_member_required
+
+@staff_member_required
 def run_indexing_view(request: HttpRequest):
     """Custom admin view for running indexing commands."""

Consider also requiring a more specific permission (e.g. is_superuser or a custom core.run_indexing perm) given the blast radius of the action.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/core/admin.py` around lines 271 - 291, The run_indexing_view
currently lacks any authentication/permission checks allowing unauthenticated
users to trigger indexing; restrict access by decorating or checking permissions
before processing POST: add a decorator such as `@staff_member_required` or
`@login_required` plus a permission check (e.g. user.is_superuser or Django's
`@permission_required`("core.run_indexing")) on run_indexing_view (or enforce
inside the function by validating request.user and returning
HttpResponseForbidden), ensure the RunIndexingForm handling remains the same,
and import the appropriate decorators (or permission utilities) and use them on
run_indexing_view to prevent unauthorized POSTs.

Comment on lines +287 to +291
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

POST path should redirect (PRG) to avoid resubmission on refresh.

Rendering the template directly after a successful POST means a browser refresh will re-dispatch the Celery task. Use redirect(...) after messages.success (and after the error branch) to follow the Post/Redirect/Get pattern.

♻️ Suggested fix
         if form.is_valid():
             call_command(...)
             messages.success(request, _("Indexing triggered!"))
+            return redirect("run_indexing")
         else:
             messages.error(request, _("Please correct the errors below."))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/core/admin.py` around lines 287 - 291, The view currently renders
the template directly after handling the POST which allows form resubmission on
refresh; change the POST handling to follow Post/Redirect/Get: after calling
messages.success(request, _("Indexing triggered!")) perform a redirect (e.g.,
return redirect(request.path) or return redirect(reverse(...))) instead of
falling through to template rendering, and likewise after messages.error(...)
redirect back to the form page so both the success and error POST branches
return a redirect; also ensure django.shortcuts.redirect (and
django.urls.reverse if used) are imported and that RunIndexingForm() remains
used for GET.


return render(
request=request,
template_name="runindexing.html",
context={
**admin.site.each_context(request),
"title": "Run Indexing Command",
"form": form,
},
)


def convert_to_isoformat(value: str) -> str | None:
"""Convert datetime-local input to ISO format."""
if value:
return datetime.fromisoformat(value).isoformat()
return None
15 changes: 15 additions & 0 deletions src/backend/core/factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from django.contrib.auth.hashers import make_password

import factory.fuzzy
from factory import post_generation
from faker import Faker

from core import models
Expand Down Expand Up @@ -159,6 +160,20 @@ def masked_by(self, create, extracted, **kwargs):
document=self, user=item, defaults={"is_masked": True}
)

@post_generation
def updated_at(self, create, extracted, **kwargs):
"""
the BaseModel.updated_at has auto_now=True.
This prevents setting a specific updated_at value with the factory.

This post_generation method bypasses this behavior.
"""
if not create or not extracted:
return

self.__class__.objects.filter(pk=self.pk).update(updated_at=extracted)
self.refresh_from_db()


class UserDocumentAccessFactory(factory.django.DjangoModelFactory):
"""Create fake document user accesses for testing."""
Expand Down
40 changes: 40 additions & 0 deletions src/backend/core/forms.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
"""Forms for the core app."""

from django import forms
from django.conf import settings


class RunIndexingForm(forms.Form):
"""
Form for running the indexing process.
"""

batch_size = forms.IntegerField(
min_value=1,
initial=settings.SEARCH_INDEXER_BATCH_SIZE,
)
lower_time_bound = forms.DateTimeField(
required=False, widget=forms.TextInput(attrs={"type": "datetime-local"})
)
upper_time_bound = forms.DateTimeField(
required=False, widget=forms.TextInput(attrs={"type": "datetime-local"})
)

def clean(self):
"""Override clean to validate time bounds."""
cleaned_data = super().clean()
self.check_time_bounds()
return cleaned_data

def check_time_bounds(self):
"""Validate that lower_time_bound is before upper_time_bound."""
lower_time_bound = self.cleaned_data.get("lower_time_bound")
upper_time_bound = self.cleaned_data.get("upper_time_bound")
if (
lower_time_bound
and upper_time_bound
and lower_time_bound > upper_time_bound
):
raise forms.ValidationError(
"Lower time bound must be before upper time bound."
)
Comment on lines +33 to +40
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Attach the cross-field error to a specific field for better admin UX.

Raising from clean() puts the error under __all__ (non-field errors), which renders separately from the field. Using self.add_error("upper_time_bound", ...) surfaces it next to the offending input.

♻️ Suggested refactor
         if (
             lower_time_bound
             and upper_time_bound
             and lower_time_bound > upper_time_bound
         ):
-            raise forms.ValidationError(
-                "Lower time bound must be before upper time bound."
-            )
+            self.add_error(
+                "upper_time_bound",
+                _("Upper time bound must be after lower time bound."),
+            )

Also consider wrapping the message in gettext_lazy (_) for i18n consistency with the rest of the admin.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (
lower_time_bound
and upper_time_bound
and lower_time_bound > upper_time_bound
):
raise forms.ValidationError(
"Lower time bound must be before upper time bound."
)
if (
lower_time_bound
and upper_time_bound
and lower_time_bound > upper_time_bound
):
self.add_error(
"upper_time_bound",
_("Upper time bound must be after lower time bound."),
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/core/forms.py` around lines 33 - 40, The cross-field validation
in the form's clean() currently raises forms.ValidationError which creates a
non-field error; change it to attach the error to the specific field by calling
self.add_error("upper_time_bound", ...) when lower_time_bound >
upper_time_bound, and wrap the message with gettext_lazy (use _(...)) to keep
i18n consistent; update the clean method where lower_time_bound and
upper_time_bound are compared to use self.add_error instead of raise.

76 changes: 61 additions & 15 deletions src/backend/core/management/commands/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@

import logging
import time
from datetime import datetime

from django.conf import settings
from django.core.management.base import BaseCommand, CommandError

from core import models
from core.services.search_indexers import get_document_indexer
from core.tasks.search import batch_document_indexer_task

logger = logging.getLogger("docs.search.bootstrap_search")
logger = logging.getLogger(__name__)


class Command(BaseCommand):
Expand All @@ -24,9 +28,32 @@ def add_arguments(self, parser):
action="store",
dest="batch_size",
type=int,
default=50,
default=settings.SEARCH_INDEXER_BATCH_SIZE,
help="Indexation query batch size",
)
parser.add_argument(
"--lower-time-bound",
action="store",
dest="lower_time_bound",
type=datetime.fromisoformat,
default=None,
help="DateTime in ISO format. Only documents updated after this date will be indexed",
)
parser.add_argument(
"--upper-time-bound",
action="store",
dest="upper_time_bound",
type=datetime.fromisoformat,
default=None,
help="DateTime in ISO format. Only documents updated before this date will be indexed",
)
parser.add_argument(
"--async",
action="store_true",
dest="async_mode",
default=False,
help="Whether to execute indexing asynchronously in a Celery task (default: False)",
)

def handle(self, *args, **options):
"""Launch and log search index generation."""
Expand All @@ -35,18 +62,37 @@ def handle(self, *args, **options):
if not indexer:
raise CommandError("The indexer is not enabled or properly configured.")

logger.info("Starting to regenerate Find index...")
start = time.perf_counter()
batch_size = options["batch_size"]
if options["async_mode"]:
batch_document_indexer_task.apply_async(
kwargs={
"lower_time_bound": options["lower_time_bound"],
"upper_time_bound": options["upper_time_bound"],
"batch_size": options["batch_size"],
"crash_safe_mode": True,
},
)
logger.info(
"Document indexing task sent to worker",
)
Comment on lines +65 to +76
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

No error handling around apply_async dispatch.

If the broker is unreachable or the task fails to enqueue (e.g., OperationalError from kombu), the exception will bubble up with an ugly traceback instead of a clean CommandError. The sync branch already wraps indexer.index(...) in a try/except; please mirror that here so the command fails gracefully.

🛠️ Suggested fix
         if options["async_mode"]:
-            batch_document_indexer_task.apply_async(
-                kwargs={
-                    "lower_time_bound": options["lower_time_bound"],
-                    "upper_time_bound": options["upper_time_bound"],
-                    "batch_size": options["batch_size"],
-                    "crash_safe_mode": True,
-                },
-            )
-            logger.info(
-                "Document indexing task sent to worker",
-            )
+            try:
+                batch_document_indexer_task.apply_async(
+                    kwargs={
+                        "lower_time_bound": options["lower_time_bound"],
+                        "upper_time_bound": options["upper_time_bound"],
+                        "batch_size": options["batch_size"],
+                        "crash_safe_mode": True,
+                    },
+                )
+            except Exception as err:
+                raise CommandError("Unable to dispatch indexing task") from err
+            logger.info("Document indexing task sent to worker")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/core/management/commands/index.py` around lines 65 - 76, Wrap the
async dispatch call to batch_document_indexer_task.apply_async in a try/except
block (similar to the sync branch) so enqueue failures are caught and re-raised
as a clean CommandError; specifically, catch the concrete broker-related
exceptions (e.g., kombu.exceptions.OperationalError) or a general Exception
around the apply_async call, log the failure via logger.error, and raise
CommandError with a clear message including the original exception details; keep
the logger.info("Document indexing task sent to worker") only on success.

else:
logger.info("Starting to regenerate Find index...")
start = time.perf_counter()

try:
count = indexer.index(batch_size=batch_size)
except Exception as err:
raise CommandError("Unable to regenerate index") from err
try:
count = indexer.index(
queryset=models.Document.objects.filter_updated_at(
lower_time_bound=options["lower_time_bound"],
upper_time_bound=options["upper_time_bound"],
),
batch_size=options["batch_size"],
crash_safe_mode=True,
)
except Exception as err:
raise CommandError("Unable to regenerate index") from err

duration = time.perf_counter() - start
logger.info(
"Search index regenerated from %d document(s) in %.2f seconds.",
count,
duration,
)
duration = time.perf_counter() - start
logger.info(
"Search index regenerated from %d document(s) in %.2f seconds.",
count,
duration,
)
32 changes: 32 additions & 0 deletions src/backend/core/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -859,6 +859,38 @@ def annotate_user_roles(self, user):
user_roles=models.Value([], output_field=output_field),
)

def filter_updated_at(self, lower_time_bound=None, upper_time_bound=None):
Comment thread
mascarpon3 marked this conversation as resolved.
"""
Filter documents by update_at.

Args:
lower_time_bound (datetime, optional):
Keep documents updated after this timestamp.
upper_time_bound (datetime, optional):
Keep documents updated before this timestamp.

Returns:
QuerySet: Filtered queryset ready for indexation.
"""
conditions = models.Q()
if lower_time_bound and upper_time_bound:
conditions = models.Q(
updated_at__gte=lower_time_bound, updated_at__lte=upper_time_bound
) | models.Q(
ancestors_deleted_at__gte=lower_time_bound,
ancestors_deleted_at__lte=upper_time_bound,
)
elif lower_time_bound:
conditions = models.Q(updated_at__gte=lower_time_bound) | models.Q(
ancestors_deleted_at__gte=lower_time_bound
)
elif upper_time_bound:
conditions = models.Q(updated_at__lte=upper_time_bound) | models.Q(
ancestors_deleted_at__lte=upper_time_bound
)

return self.filter(conditions)
Comment on lines +875 to +892
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restored documents can be missed by bounded re-indexing.

This filter only sees the current updated_at/ancestors_deleted_at values. On restore, ancestors_deleted_at is cleared or moved to an older ancestor timestamp, and restore() saves only deleted_at/ancestors_deleted_at, so the restored document can stay inactive in the search index after a time-bounded run.

🛠️ Suggested direction

Either bump updated_at when restoring documents/descendants, or introduce a dedicated indexation timestamp that is updated on soft-delete and restore transitions.

-        self.save(update_fields=["deleted_at", "ancestors_deleted_at"])
+        self.updated_at = timezone.now()
+        self.save(update_fields=["deleted_at", "ancestors_deleted_at", "updated_at"])

Also ensure descendant restore updates have a timestamp that filter_updated_at() can select.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/core/models.py` around lines 875 - 892, The time-bounded filter
in filter_updated_at() currently checks only updated_at and
ancestors_deleted_at, which misses restored documents because restore()
clears/moves ancestors_deleted_at without bumping updated_at; change the restore
path (restore()/descendant restore logic) to also update a field that the filter
can use (either increment updated_at or set a dedicated indexation timestamp
like index_updated_at on soft-delete and restore), and update
filter_updated_at() to include that indexation timestamp in its Q conditions so
restored documents fall inside bounded re-index runs.



class DocumentManager(MP_NodeManager.from_queryset(DocumentQuerySet)):
"""
Expand Down
Loading
Loading