-
Notifications
You must be signed in to change notification settings - Fork 87
Open
Description
Hello,
While running ./run_metric_caching.sh, I see some warnings that I wasn't sure if I should be concerned about.
Below are the beginning of the logs:
(navsim) [neeloyc2@gpub025 evaluation]$ ./run_metric_caching.sh [21/21]
2025-12-06 15:48:23,420 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/script/builders/worker_pool_builder.py:19} Building WorkerPool...
2025-12-06 15:48:26,314 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/utils/multithreading/worker_ray_no_torch.py:51} Not using GPU in ray
2025-12-06 15:48:26,315 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/utils/multithreading/worker_ray_no_torch.py:77} Starting ray local!
2025-12-06 15:48:33,372 INFO worker.py:2012 -- Started a local Ray instance.
/projects/beje/neeloyc2/miniconda3/envs/navsim/lib/python3.9/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices
env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
warnings.warn(
2025-12-06 15:48:54,334 INFO {/projects/beje/neeloyc2/miniconda3/envs/navsim/lib/python3.9/site-packages/nuplan/planning/utils/multithreading/worker_pool.py:101} Worker: RayDistributedNoTorch
2025-12-06 15:48:54,334 INFO {/projects/beje/neeloyc2/miniconda3/envs/navsim/lib/python3.9/site-packages/nuplan/planning/utils/multithreading/worker_pool.py:102} Number of nodes: 1
Number of CPUs per node: 8
Number of GPUs per node: 0
Number of threads across all nodes: 8
2025-12-06 15:48:54,335 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/script/builders/worker_pool_builder.py:27} Building WorkerPool...DONE!
2025-12-06 15:48:54,335 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/script/run_metric_caching.py:29} Starting Metric Caching...
Loading logs: 38%|███████████████████████████████████████████████████████████████████▎ | 52/136 [00
:01<00:02, 39.71it/s](pid=gcs_server) [2025-12-06 15:48:56,961 E 2646651 2646651] (gcs_server) gcs_server.cc:302: Failed to establish connection to the event+metrics exporter agent. Events and metrics will n
ot be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
Loading logs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 136/136 [00
:04<00:00, 28.55it/s]
2025-12-06 15:49:00,111 INFO {/work/hdd/beje/neeloyc2/navsim_workspace/navsim/navsim/planning/metric_caching/caching.py:166} Starting metric caching of 136 files...
Ray objects: 0%| | 0/8 [00:00<?, ?it/s](raylet) [2025-12-06 15:49:03,429 E 2646855 2646855] (raylet) main.cc:975: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
Loading logs: 0%| | 0/17 [00:00<?, ?it/s]
Loading logs: 6%|▌ | 1/17 [00:00<00:01, 9.95it/s]
(wrapped_fn pid=2646946) [2025-12-06 15:49:24,276 E 2646946 2647257] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[2025-12-06 15:49:24,447 E 2645650 2646941] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
Loading logs: 100%|██████████| 17/17 [00:04<00:00, 3.90it/s]
(wrapped_fn pid=2646942) INFO:navsim.planning.metric_caching.caching:Extracted 1379 scenarios for thread_id=5ce39b20-37a9-4a94-af45-81ebe1c68e6f, node_id=0.
Loading logs: 94%|█████████▍| 16/17 [00:04<00:00, 6.59it/s]
(wrapped_fn pid=2646942) INFO:navsim.planning.metric_caching.caching:Processing scenario 1 / 1379 in thread_id=5ce39b20-37a9-4a94-af45-81ebe1c68e6f, node_id=0
Loading logs: 0%| | 0/17 [00:00<?, ?it/s] [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Loading logs: 82%|████████▏ | 14/17 [00:05<00:01, 2.74it/s] [repeated 98x across cluster]
(wrapped_fn pid=2646944) [2025-12-06 15:49:24,447 E 2646944 2647460] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 7x across cluster]
Loading logs: 100%|██████████| 17/17 [00:05<00:00, 3.11it/s] [repeated 6x across cluster]
(wrapped_fn pid=2646943) INFO:navsim.planning.metric_caching.caching:Extracted 1899 scenarios for thread_id=5dc7b354-a027-4042-a8b8-0c7ca7b5a410, node_id=0. [repeated 7x across cluster]
Loading logs: 100%|██████████| 17/17 [00:05<00:00, 3.17it/s] [repeated 3x across cluster]
(wrapped_fn pid=2646942) INFO:navsim.planning.metric_caching.caching:Processing scenario 2 / 1379 in thread_id=5ce39b20-37a9-4a94-af45-81ebe1c68e6f, node_id=0 [repeated 8x across cluster]
Loading logs: 88%|████████▊ | 15/17 [00:05<00:00, 3.33it/s] [repeated 2x across cluster]
(wrapped_fn pid=2646945) INFO:navsim.planning.metric_caching.caching:Processing scenario 2 / 1800 in thread_id=87e019a7-ef38-4687-98c2-1b257c06101c, node_id=0 [repeated 2x across cluster]
(wrapped_fn pid=2646945) INFO:navsim.planning.metric_caching.caching:Processing scenario 3 / 1800 in thread_id=87e019a7-ef38-4687-98c2-1b257c06101c, node_id=0 [repeated 5x across cluster]
(wrapped_fn pid=2646944) INFO:navsim.planning.metric_caching.caching:Processing scenario 2 / 1105 in thread_id=87d96831-24e4-45e9-8a51-38c993cc499b, node_id=0 [repeated 5x across cluster]
(wrapped_fn pid=2646949) INFO:navsim.planning.metric_caching.caching:Processing scenario 4 / 1398 in thread_id=d7f6cf92-93c1-4949-bb27-26ccd94ed98d, node_id=0 [repeated 6x across cluster]
(wrapped_fn pid=2646948) INFO:navsim.planning.metric_caching.caching:Processing scenario 3 / 1580 in thread_id=4dff173c-2128-4e16-a970-8422238b69c4, node_id=0 [repeated 4x across cluster]
(wrapped_fn pid=2646945) INFO:navsim.planning.metric_caching.caching:Processing scenario 7 / 1800 in thread_id=87e019a7-ef38-4687-98c2-1b257c06101c, node_id=0 [repeated 8x across cluster]
(wrapped_fn pid=2646943) INFO:navsim.planning.metric_caching.caching:Processing scenario 5 / 1899 in thread_id=5dc7b354-a027-4042-a8b8-0c7ca7b5a410, node_id=0 [repeated 4x across cluster]
(wrapped_fn pid=2646946) INFO:navsim.planning.metric_caching.caching:Processing scenario 7 / 1563 in thread_id=4c2a8688-0451-4732-9e95-39b5d7ad0354, node_id=0 [repeated 6x across cluster]
...
...
...
Is it normal to see the error: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 ?
Also wanted to report that I had to manually set the number_of_cpus_per_node=8 in navsim_workspace/navsim/navsim/planning/utils/multithreading/worker_ray_no_torch.py rather than the default cpu_count(logical=True) (which was 64). With 64 cores, Ray would freeze when attempting to connect to the instance.
Thanks for your time!
Metadata
Metadata
Assignees
Labels
No labels