Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
328 changes: 328 additions & 0 deletions docs/IndexerRefactorStrategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
# Ocean Node Indexer - Event Monitoring & Error Handling

## Table of Contents

1. [Overview](#overview)
2. [🔴 PROPOSED IMPROVEMENTS (Post-Meeting Changes)](#-proposed-improvements-post-meeting-changes)

---

## Overview

### Current Indexer Architecture

The Ocean Node Indexer is built with the following design principles (see [Architecture.md](./Arhitecture.md) for details):

- **Single-threaded, non-blocking design**: Uses Node.js async/await for concurrent execution across multiple chains
- **ChainIndexer instances**: Each blockchain network is monitored by a dedicated ChainIndexer instance running concurrently via the event loop
- **Event-driven communication**: Components communicate through EventEmitter for clean separation of concerns
- **Efficient I/O handling**: All RPC calls, database operations, and network requests are non-blocking, allowing high concurrency without worker threads

### Proposed Architecture Evolution

The refactoring strategy below maintains the core single-threaded, non-blocking architecture while introducing key improvements:

1. **EventEmitter → Persistent Queues**: Replace synchronous EventEmitter with persistent queue system for better reliability and observability
2. **Event-level retry**: Move retry logic from embedded operations to event-level processing
3. **Enhanced error tracking**: Introduce comprehensive logging via `ddo_logs` index
4. **Per-chain resilience**: Isolate chain failures to prevent cascading issues

These changes preserve the efficient I/O model and concurrent ChainIndexer execution while adding production-grade error handling and monitoring.

---

## 🔴 PROPOSED IMPROVEMENTS (Post-Meeting Changes)

> **Status:** Draft proposals from Jan 27, 2026 meeting
> **Goal:** Improve reliability, decoupling, and error handling

### 1. 🎯 EVENT-LEVEL RETRY MECHANISM WITH QUEUES

**Current Issue:** Retry logic is deeply embedded in event processing steps (e.g., inside DDO decryption)

**Proposed Change:**

- **Move retry logic to event level** (not deep inside processing steps)
- **Implement queue-based retry system** for all 12 event types
- **Decouple retry from specific operations** (e.g., decrypt, p2p, HTTP)

**Implementation:**

```
┌─────────────────────────────────────────────────────────────┐
│ EVENT ERROR PROCESSING QUEUE │
├─────────────────────────────────────────────────────────────┤
│ │
│ Event Detected │
│ ↓ │
│ Send to Processor │
│ ↓ │
│ Process Event │
│ ├─ Success → Mark complete, update DB │
│ └─ Failure → Add to Retry Queue with backoff │
│ │
│ Retry Queue (exponential backoff): │
│ - Retry 1: ~10 seconds │
│ - Retry 2: ~1 minute │
│ - Retry 3: ~10 minutes │
│ - Retry 4: ~1 hour │
│ - Retry 5: ~1 week (final attempt) │
│ │
│ Benefits: │
│ ✓ Non-blocking (doesn't halt chain indexing) │
│ ✓ Works for ALL error types (HTTP, P2P, RPC, DB) │
│ ✓ Visible retry state in monitoring │
└─────────────────────────────────────────────────────────────┘
```

**Applies to:** All event processors, especially METADATA_CREATED/UPDATED

---

### 2. 🗄️ NEW DATABASE INDEX: `ddo_logs`

**Current Issue:**

- `ddoState` only tracks metadata events
- Order and pricing events have no error tracking
- No unified view of all DDO-related events

**Proposed Change:**

- Create new DB index: **`ddo_logs`**
- Store **all events** related to a DID (metadata, orders, pricing)
- Similar structure to `ddoState` but broader scope
- **Add handler and routes (HTTP + P2P)** to query all information about a DID, transaction, or event
- Similar to existing `ddo-state` handler but for comprehensive logs
- Enable querying by: `did`, `txHash`, `blockNumber`, `eventType`
- Support both HTTP API endpoints and P2P protocol for distributed querying

**Schema:**

```typescript
interface DdoLog {
did: string // Indexed
chainId: number // Indexed
eventType: string // METADATA_CREATED, ORDER_STARTED, etc.
eventHash: string // Event signature hash
txHash: string // Transaction hash
blockNumber: number // Block number
timestamp: number // Event timestamp
status: 'success' | 'failed' | 'retrying'
error?: string // Error message if failed
retryCount: number // Number of retry attempts default 0
lastRetry?: number // Timestamp of last retry
metadata?: Record<string, any> // Event-specific data
}
```

**Benefits:**

- Single source of truth for all DDO events
- Easier debugging (see all events for a DID)
- Track pricing/order event errors (not just metadata)
- Audit trail for compliance

---

### 3. 🔄 REPLACE EventEmitter WITH QUEUES

**Current Issue:**

- Using `EventEmitter` for communication
- Synchronous, blocking behavior
- No retry/replay capability
- Difficult to test

**Proposed Change:**

- Replace `EventEmitter` with **persistent queue system**
- Use queue for:
- ✓ Newly indexed assets (instead of `eventEmitter.emit()`)
- ✓ Reindex requests (block & transaction level)
- ✓ Admin commands

**Queue Types:**

```
1. EVENT_PROCESSING_QUEUE (primary)
- New events from blockchain
- Priority: FIFO with retry backoff

2. REINDEX_QUEUE (existing, enhance)
- Block-level reindex
- Transaction-level reindex
- Priority: Admin requests > Auto-retry

3. ORDER_QUEUE (new)
- Store orders even if DDO not found
- Process when DDO becomes available
```

**Benefits:**

- Testable (can inject mock queue)
- Observable (queue depth, retry counts)
- Resilient (survives crashes)
- Decoupled (no tight coupling between components)

---

### 4. 📦 HANDLE MISSING DDO IN ORDER/PRICING EVENTS

**Current Issue:**

- If DDO not found → skip order/pricing event
- Lost data if DDO indexed later

**Proposed Change:**

**Unified Queue-Based Approach for Both Orders and Pricing Events:**

```
IF DDO not found:
1. Check if DDO exists in database
2. If not found → add event to pending queue
3. Store event in ddo_logs with status: 'pending_ddo'
4. Link event to DID for future reconciliation
5. When DDO is successfully indexed:
→ Process all pending events for that DID (orders + pricing)
→ Update event status from 'pending_ddo' to 'success' or 'failed'
→ Maintain event order based on blockNumber and logIndex
```

**Queue Structure:**

```typescript
interface PendingEvent {
did: string
eventType: string // ORDER_STARTED, ORDER_REUSED, DISPENSER_*, EXCHANGE_*
chainId: number
txHash: string
blockNumber: number
timestamp: number
retryCount: number
queuedAt: number
}
```

**Benefits:**

- **Consistent approach** for all event types (orders + pricing)
- **No data loss** - all events queued and processed eventually
- **Maintains event order** using blockNumber and logIndex
- **Automatic reconciliation** when DDO becomes available
- **Better observability** - track pending events per DID
- **Prevents orphaned records** - only create records when DDO exists

---

### 5. 🚫 MOVE RETRY LOGIC TO ChainIndexer (Block Only That Chain)

**Current Issue:**

- Crawler startup retry in `OceanIndexer`
- Failure blocks **entire node** (all chains)

**Proposed Change:**

- Move `retryCrawlerWithDelay()` → **ChainIndexer**
- Each chain fails independently
- Other chains continue indexing

**Benefits:**

- Resilient multi-chain indexing
- One bad RPC doesn't kill everything
- Easier debugging (per-chain logs)

---

### 6. 📍 BLOCK RETRY QUEUE IMPROVEMENTS

**Current Issue:**

- Failed block retried, but `lastIndexedBlock` not updated
- Same block retried indefinitely
- No expiry/max retry limit

**Proposed Change:**

```
When block added to retry queue:
1. Update lastIndexedBlock (move forward)
2. Add block to retry queue with metadata:
- blockNumber
- retryCount (starts at 0)
- maxRetries (default: 5)
- lastError
- expiryDate (when to give up)
3. Process retry queue separately (exponential backoff)
4. If maxRetries exceeded → log to failed_blocks table
```

**Retry Queue Schema:**

```typescript
interface BlockRetryTask {
chainId: number
blockNumber: number
retryCount: number
maxRetries: number
lastError: string
lastRetryAt: number
}
```

**Benefits:**

- Indexer moves forward (doesn't get stuck)
- Failed blocks retried in background
- Clear failure tracking

---

### 7. 🌐 REMOVE ECONNREFUSED-ONLY CONDITION

**Current Issue:**

- Retry only on `ECONNREFUSED` error
- Other errors (timeout, 500, p2p failures) not retried

**Proposed Change:**

- With event-level retry, **retry ALL error types**:
- ✓ RPC errors (timeout, 500, 429 rate limit)
- ✓ HTTP errors (decrypt service down)
- ✓ P2P errors (peer unreachable)
- ✓ Database errors (temp unavailable)
- ✓ Validation errors (maybe retryable)

---

### 8. ✅ UPDATE TESTS

**Required Test Updates:**

- Remove tests checking `EventEmitter` behavior
- Add tests for queue-based processing
- Add tests for retry with exponential backoff
- Add tests for orphaned orders
- Add tests for per-chain failure isolation
- Add tests for `ddo_logs` index
- Add tests for block retry with expiry

---

### Summary Table

| # | Change | Current Pain | Benefit | Effort | Priority |
| --- | --------------------------------------------- | --------------------------------- | ------------------------------------ | ------ | ----------- |
| 1 | Event-level retry + queues | Retry logic scattered, blocking | Unified, non-blocking, testable | High | 🔴 Critical |
| 2 | `ddo_logs` DB index | No order/pricing error tracking | Full audit trail, debugging | Medium | 🟡 High |
| 3 | Replace EventEmitter with queues | Blocking, not testable, no replay | Observable, resilient, testable | High | 🔴 Critical |
| 4 | Handle missing DDO (orphaned orders) | Lost orders/pricing data | No data loss, reconciliation | Medium | 🟡 High |
| 5 | Per-chain startup retry (ChainIndexer) | One failure kills entire node | Isolated failures, resilient | Low | 🔴 Critical |
| 6 | Block retry queue with expiry | Indexer stuck on bad blocks | Progress continues, background retry | Medium | 🟡 High |
| 7 | Retry ALL error types (not just ECONNREFUSED) | P2P/timeout/429 not retried | Comprehensive error handling | Low | 🟡 High |
| 8 | Update tests | Tests assume old architecture | Tests match new architecture | Medium | 🟢 Medium |
Loading