Skip to content

RDoc-3843_taskErrors - document new task errors views and functionality#2450

Open
reebhub wants to merge 7 commits into
ravendb:mainfrom
reebhub:RDoc-3843_taskErrors
Open

RDoc-3843_taskErrors - document new task errors views and functionality#2450
reebhub wants to merge 7 commits into
ravendb:mainfrom
reebhub:RDoc-3843_taskErrors

Conversation

@reebhub
Copy link
Copy Markdown
Contributor

@reebhub reebhub commented May 19, 2026

Issue link

RDoc-3843
RDoc-3844
RDoc-3845
RDoc-3849
RDoc-3775
RDoc-3854
RDoc-3861
RDoc-3851
RDoc-3811

Type of change

  • Content - docs
  • Content - cloud
  • Content - guides
  • Content - start pages/other
  • New docs feature (consider updating /templates or readme)
  • Bug fix
  • Optimization
  • Other

Changes in docs URLs

  • No changes in docs URLs
  • Articles are restructured, URLs will change, mapping is required (update /scripts/redirects.json file, set Documents Moved PR label)

Changes in UX/UI

  • No changes in UX/UI
  • Changes in UX/UI (include screenshots and description)

@reebhub reebhub requested a review from Lwiel May 19, 2026 01:54
Comment on lines +113 to +115
* **Persistence** (AI tasks only)
The task could not save its results back to the database. Typical causes include write
conflicts or storage errors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also occurs when we fail to update process state, so it's not AI tasks only

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The retention is per task and per table, so a single noisy task cannot push errors out of an
unrelated task. The cap is not configurable.

Errors are also included in the server's debug package as `etl.errors.json`, so support
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI task errors are stored separately

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


A task recovers automatically as new batches complete. The health state transitions from
`Failed` back to `Impaired`, and from `Impaired` back to `Healthy`, as the running error rate
falls below each threshold. There is no manual "reset" action.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth noting we reset health state back to Healthy on task configuration update

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +222 to +229
`GET /databases/*/tasks/errors` returns errors across all ETL and AI tasks.
`GET /databases/*/etl/errors` and `GET /databases/*/ai/errors` return errors per category.
`DELETE` variants of each path remove errors in bulk, optionally filtered by task name or
category. For example, `DELETE /databases/*/etl/errors?name=<task-name>` clears the errors
of one specific ETL task.
`POST /databases/*/etl/retry-batch` forces an immediate retry of an ETL task currently in
fallback mode.
See [Debug Endpoints](../../server/troubleshooting/debug-routes.mdx#debug-endpoints) for the full reference.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it a list?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


<ContentFrame>

### Task health indicators
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's mention that only node the task is currently on and nodes that contain any errors are displayed here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Retention is per task and per table, so a single noisy task cannot push errors out of
an unrelated task.

* Errors are also included in the server's debug package as `etl.errors.json`, so
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's separate json file with AI tasks errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* **Item error**
An error that occurred while processing a single document. The document was skipped and the
task moved on to the remaining documents in the batch. The error record includes the
document ID.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth adding that an item error on a given document makes that the doc is skipped and the process continues to move forward.

* **Process error**
An error that occurred while processing a batch as a whole and may affect multiple documents,
such as a failure to send the batch to its destination. The error record includes the number
of documents the failing batch attempted to handle.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A process error from the other side makes that the process enters will continue to retry the batch until it succeeds (with a fallback strategy)


* **Persistence**
The task could not save its results back to the database, or could not update its own
process state. Typical causes include write conflicts or storage errors.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typical causes include write conflicts or storage errors.

Write conflicts?. @Lwiel please take a look


* Each task keeps two dedicated tables on disk: one for item errors and one for process
errors.
ETL and AI task errors are kept in separate storage and don't share these tables.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More precisely each ETL or AI task keeps its errors in separate tables

* Retention is per task and per table, so a single noisy task cannot push errors out of
an unrelated task.

* Errors are also included in the server's debug package as `etl.errors.json` (for
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning? It's very detailed info about Debug Package


RavenDB watches the ratio between a task's failed items and the total number of items the
task has attempted to process. The ratio is computed as an EWMA (Exponentially Weighted
Moving Average) and is updated continuously as new batches complete.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding it's time agnostic EWMA? @Lwiel

in the [HTTP endpoints](../../server/troubleshooting/debug-routes.mdx#debug-endpoints),
in the [SNMP OIDs](../../server/administration/snmp/snmp-overview.mdx#list-of-oids),
in the [Prometheus metrics](../../server/administration/monitoring/prometheus.mdx#metrics-provided-by-the-prometheus-endpoint),
and in the [JSON monitoring endpoints](../../server/administration/monitoring/telegraf.mdx#monitoring-endpoints).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON monitoring endpoints

Is it how we officially call this feature? It thought it's Monitoring endpoints (https://docs.ravendb.net/7.2/server/administration/monitoring/telegraf#monitoring-endpoints)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants