Skip to content

Missing structured task supervision for background workers #56

@qj0r9j0vc2

Description

@qj0r9j0vc2

Summary

The node spawns multiple background workers (Primary, Worker, network handlers) without a centralized supervision strategy, making coordinated shutdown and failure handling difficult.

Problem Locations

  • crates/data-chain/src/primary/runner.rs:181 - Primary spawned
  • crates/data-chain/src/worker/core.rs:225 - Worker spawned
  • crates/node/src/main.rs:706 - Node handle spawned
  • crates/node/src/network.rs:106 - Network listener spawned

Current Pattern

// crates/data-chain/src/primary/runner.rs:180-183
let config_clone = config.clone();
let handle = tokio::spawn(async move {
    let mut primary = Primary::new_with_storage(...).await;
    // ...
});

Each component spawns tasks independently with no coordination.

Issues

  1. No Graceful Shutdown: Components can't coordinate shutdown order
  2. Dependency Blindness: Network may shutdown before consensus flushes
  3. Partial Failure Handling: One failed component doesn't trigger others to gracefully stop
  4. Resource Cleanup: No guarantee storage is flushed before process exit

Recommended Fix

Implement a task supervision tree similar to Erlang/OTP:

use tokio_util::task::TaskTracker;
use tokio_util::sync::CancellationToken;

pub struct NodeSupervisor {
    tracker: TaskTracker,
    token: CancellationToken,
}

impl NodeSupervisor {
    pub fn new() -> Self {
        Self {
            tracker: TaskTracker::new(),
            token: CancellationToken::new(),
        }
    }
    
    pub fn spawn<F>(&self, name: &str, future: F) 
    where
        F: Future<Output = Result<(), Error>> + Send + 'static
    {
        let token = self.token.clone();
        self.tracker.spawn(async move {
            tokio::select! {
                result = future => {
                    if let Err(e) = result {
                        error!("{} failed: {:?}", name, e);
                    }
                }
                _ = token.cancelled() => {
                    info!("{} shutting down", name);
                }
            }
        });
    }
    
    pub async fn shutdown(&self) {
        info!("Initiating graceful shutdown");
        self.token.cancel();
        self.tracker.close();
        self.tracker.wait().await;
        info!("All tasks terminated");
    }
}

Shutdown Order Recommendation

  1. Stop accepting new network connections
  2. Drain in-flight consensus rounds
  3. Flush pending storage writes
  4. Close database connections
  5. Exit

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions