Skip to content

Generating metric cache #175

@TheNeeloy

Description

@TheNeeloy

Hello,

While running ./run_metric_caching.sh, I see some warnings that I wasn't sure if I should be concerned about.
Below are the beginning of the logs:

(navsim) [neeloyc2@gpub025 evaluation]$ ./run_metric_caching.sh                                                                                                                                         [21/21]
2025-12-06 15:48:23,420 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...                                               
2025-12-06 15:48:26,314 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/utils/multithreading/worker_ray_no_torch.py:51}  Not using GPU in ray                                            
2025-12-06 15:48:26,315 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/utils/multithreading/worker_ray_no_torch.py:77}  Starting ray local!                                             
2025-12-06 15:48:33,372 INFO worker.py:2012 -- Started a local Ray instance.                                                                                                                                   
/projects/beje/neeloyc2/miniconda3/envs/navsim/lib/python3.9/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices 
env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0                                                            
  warnings.warn(                                                                                                                                                                                               
2025-12-06 15:48:54,334 INFO {/projects/beje/neeloyc2/miniconda3/envs/navsim/lib/python3.9/site-packages/nuplan/planning/utils/multithreading/worker_pool.py:101}  Worker: RayDistributedNoTorch               
2025-12-06 15:48:54,334 INFO {/projects/beje/neeloyc2/miniconda3/envs/navsim/lib/python3.9/site-packages/nuplan/planning/utils/multithreading/worker_pool.py:102}  Number of nodes: 1                          
Number of CPUs per node: 8                                                                                                                                                                                     
Number of GPUs per node: 0                                                                                                                                                                                     
Number of threads across all nodes: 8                                                                                                                                                                          
2025-12-06 15:48:54,335 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/script/builders/worker_pool_builder.py:27}  Building WorkerPool...DONE!                                          
2025-12-06 15:48:54,335 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/script/run_metric_caching.py:29}  Starting Metric Caching...                                                     
Loading logs:  38%|███████████████████████████████████████████████████████████████████▎                                                                                                            | 52/136 [00
:01<00:02, 39.71it/s](pid=gcs_server) [2025-12-06 15:48:56,961 E 2646651 2646651] (gcs_server) gcs_server.cc:302: Failed to establish connection to the event+metrics exporter agent. Events and metrics will n
ot be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14                                                                                          
Loading logs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00
:04<00:00, 28.55it/s]                                                                                                                                                                                          
2025-12-06 15:49:00,111 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/metric_caching/caching.py:166}  Starting metric caching of 136 files...                                          
Ray objects:   0%|                                                                                                                                                                                            | 0/8 [00:00<?, ?it/s](raylet) [2025-12-06 15:49:03,429 E 2646855 2646855] (raylet) main.cc:975: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
Loading logs:   0%|          | 0/17 [00:00<?, ?it/s]                                                   
Loading logs:   6%|▌         | 1/17 [00:00<00:01,  9.95it/s]                                                                                                                                                   
(wrapped_fn pid=2646946) [2025-12-06 15:49:24,276 E 2646946 2647257] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14                    
[2025-12-06 15:49:24,447 E 2645650 2646941] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14                                             
Loading logs: 100%|██████████| 17/17 [00:04<00:00,  3.90it/s]                                                                                                                                                  
(wrapped_fn pid=2646942) INFO:navsim.planning.metric_caching.caching:Extracted 1379 scenarios for thread_id=5ce39b20-37a9-4a94-af45-81ebe1c68e6f, node_id=0.
Loading logs:  94%|█████████▍| 16/17 [00:04<00:00,  6.59it/s]                                                                                                                                                  
(wrapped_fn pid=2646942) INFO:navsim.planning.metric_caching.caching:Processing scenario 1 / 1379 in thread_id=5ce39b20-37a9-4a94-af45-81ebe1c68e6f, node_id=0
Loading logs:   0%|          | 0/17 [00:00<?, ?it/s] [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)                
Loading logs:  82%|████████▏ | 14/17 [00:05<00:01,  2.74it/s] [repeated 98x across cluster]                                                                                                                    
(wrapped_fn pid=2646944) [2025-12-06 15:49:24,447 E 2646944 2647460] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 7x across cluster]
Loading logs: 100%|██████████| 17/17 [00:05<00:00,  3.11it/s] [repeated 6x across cluster]                                                                                                                     
(wrapped_fn pid=2646943) INFO:navsim.planning.metric_caching.caching:Extracted 1899 scenarios for thread_id=5dc7b354-a027-4042-a8b8-0c7ca7b5a410, node_id=0. [repeated 7x across cluster]
Loading logs: 100%|██████████| 17/17 [00:05<00:00,  3.17it/s] [repeated 3x across cluster]                                                                                                                     
(wrapped_fn pid=2646942) INFO:navsim.planning.metric_caching.caching:Processing scenario 2 / 1379 in thread_id=5ce39b20-37a9-4a94-af45-81ebe1c68e6f, node_id=0 [repeated 8x across cluster]
Loading logs:  88%|████████▊ | 15/17 [00:05<00:00,  3.33it/s] [repeated 2x across cluster]                                                                                                                     
(wrapped_fn pid=2646945) INFO:navsim.planning.metric_caching.caching:Processing scenario 2 / 1800 in thread_id=87e019a7-ef38-4687-98c2-1b257c06101c, node_id=0 [repeated 2x across cluster]
(wrapped_fn pid=2646945) INFO:navsim.planning.metric_caching.caching:Processing scenario 3 / 1800 in thread_id=87e019a7-ef38-4687-98c2-1b257c06101c, node_id=0 [repeated 5x across cluster]
(wrapped_fn pid=2646944) INFO:navsim.planning.metric_caching.caching:Processing scenario 2 / 1105 in thread_id=87d96831-24e4-45e9-8a51-38c993cc499b, node_id=0 [repeated 5x across cluster]
(wrapped_fn pid=2646949) INFO:navsim.planning.metric_caching.caching:Processing scenario 4 / 1398 in thread_id=d7f6cf92-93c1-4949-bb27-26ccd94ed98d, node_id=0 [repeated 6x across cluster]
(wrapped_fn pid=2646948) INFO:navsim.planning.metric_caching.caching:Processing scenario 3 / 1580 in thread_id=4dff173c-2128-4e16-a970-8422238b69c4, node_id=0 [repeated 4x across cluster]
(wrapped_fn pid=2646945) INFO:navsim.planning.metric_caching.caching:Processing scenario 7 / 1800 in thread_id=87e019a7-ef38-4687-98c2-1b257c06101c, node_id=0 [repeated 8x across cluster]
(wrapped_fn pid=2646943) INFO:navsim.planning.metric_caching.caching:Processing scenario 5 / 1899 in thread_id=5dc7b354-a027-4042-a8b8-0c7ca7b5a410, node_id=0 [repeated 4x across cluster]
(wrapped_fn pid=2646946) INFO:navsim.planning.metric_caching.caching:Processing scenario 7 / 1563 in thread_id=4c2a8688-0451-4732-9e95-39b5d7ad0354, node_id=0 [repeated 6x across cluster]
...
...
...

Is it normal to see the error: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 ?

Also wanted to report that I had to manually set the number_of_cpus_per_node=8 in navsim_workspace/navsim/navsim/planning/utils/multithreading/worker_ray_no_torch.py rather than the default cpu_count(logical=True) (which was 64). With 64 cores, Ray would freeze when attempting to connect to the instance.

Thanks for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions