RDoc-3843_taskErrors - document new task errors views and functionality#2450
RDoc-3843_taskErrors - document new task errors views and functionality#2450reebhub wants to merge 7 commits into
Conversation
…nerating embeddings)
| * **Persistence** (AI tasks only) | ||
| The task could not save its results back to the database. Typical causes include write | ||
| conflicts or storage errors. |
There was a problem hiding this comment.
It also occurs when we fail to update process state, so it's not AI tasks only
| The retention is per task and per table, so a single noisy task cannot push errors out of an | ||
| unrelated task. The cap is not configurable. | ||
|
|
||
| Errors are also included in the server's debug package as `etl.errors.json`, so support |
There was a problem hiding this comment.
AI task errors are stored separately
|
|
||
| A task recovers automatically as new batches complete. The health state transitions from | ||
| `Failed` back to `Impaired`, and from `Impaired` back to `Healthy`, as the running error rate | ||
| falls below each threshold. There is no manual "reset" action. |
There was a problem hiding this comment.
Maybe it's worth noting we reset health state back to Healthy on task configuration update
| `GET /databases/*/tasks/errors` returns errors across all ETL and AI tasks. | ||
| `GET /databases/*/etl/errors` and `GET /databases/*/ai/errors` return errors per category. | ||
| `DELETE` variants of each path remove errors in bulk, optionally filtered by task name or | ||
| category. For example, `DELETE /databases/*/etl/errors?name=<task-name>` clears the errors | ||
| of one specific ETL task. | ||
| `POST /databases/*/etl/retry-batch` forces an immediate retry of an ETL task currently in | ||
| fallback mode. | ||
| See [Debug Endpoints](../../server/troubleshooting/debug-routes.mdx#debug-endpoints) for the full reference. |
|
|
||
| <ContentFrame> | ||
|
|
||
| ### Task health indicators |
There was a problem hiding this comment.
Maybe let's mention that only node the task is currently on and nodes that contain any errors are displayed here
…ing-tasks/general-info) to use it as a reference from task errors pages
| * Retention is per task and per table, so a single noisy task cannot push errors out of | ||
| an unrelated task. | ||
|
|
||
| * Errors are also included in the server's debug package as `etl.errors.json`, so |
There was a problem hiding this comment.
There's separate json file with AI tasks errors
| * **Item error** | ||
| An error that occurred while processing a single document. The document was skipped and the | ||
| task moved on to the remaining documents in the batch. The error record includes the | ||
| document ID. |
There was a problem hiding this comment.
It's worth adding that an item error on a given document makes that the doc is skipped and the process continues to move forward.
| * **Process error** | ||
| An error that occurred while processing a batch as a whole and may affect multiple documents, | ||
| such as a failure to send the batch to its destination. The error record includes the number | ||
| of documents the failing batch attempted to handle. |
There was a problem hiding this comment.
A process error from the other side makes that the process enters will continue to retry the batch until it succeeds (with a fallback strategy)
|
|
||
| * **Persistence** | ||
| The task could not save its results back to the database, or could not update its own | ||
| process state. Typical causes include write conflicts or storage errors. |
There was a problem hiding this comment.
Typical causes include write conflicts or storage errors.
Write conflicts?. @Lwiel please take a look
|
|
||
| * Each task keeps two dedicated tables on disk: one for item errors and one for process | ||
| errors. | ||
| ETL and AI task errors are kept in separate storage and don't share these tables. |
There was a problem hiding this comment.
More precisely each ETL or AI task keeps its errors in separate tables
| * Retention is per task and per table, so a single noisy task cannot push errors out of | ||
| an unrelated task. | ||
|
|
||
| * Errors are also included in the server's debug package as `etl.errors.json` (for |
There was a problem hiding this comment.
Is it worth mentioning? It's very detailed info about Debug Package
|
|
||
| RavenDB watches the ratio between a task's failed items and the total number of items the | ||
| task has attempted to process. The ratio is computed as an EWMA (Exponentially Weighted | ||
| Moving Average) and is updated continuously as new batches complete. |
| in the [HTTP endpoints](../../server/troubleshooting/debug-routes.mdx#debug-endpoints), | ||
| in the [SNMP OIDs](../../server/administration/snmp/snmp-overview.mdx#list-of-oids), | ||
| in the [Prometheus metrics](../../server/administration/monitoring/prometheus.mdx#metrics-provided-by-the-prometheus-endpoint), | ||
| and in the [JSON monitoring endpoints](../../server/administration/monitoring/telegraf.mdx#monitoring-endpoints). |
There was a problem hiding this comment.
JSON monitoring endpoints
Is it how we officially call this feature? It thought it's Monitoring endpoints (https://docs.ravendb.net/7.2/server/administration/monitoring/telegraf#monitoring-endpoints)
Issue link
RDoc-3843
RDoc-3844
RDoc-3845
RDoc-3849
RDoc-3775
RDoc-3854
RDoc-3861
RDoc-3851
RDoc-3811
Type of change
/templatesor readme)Changes in docs URLs
/scripts/redirects.jsonfile, setDocuments MovedPR label)Changes in UX/UI