Skip to content

Launching SmartRedis on the cluster #32

@Fantasy98

Description

@Fantasy98

Hi!
I got an issue with lanunching the framework on a cluster.

Encountered issue:

When using a single GPU node on the cluster JUWELS, I encoutered an issue with using the slurm launcher

  # Setting up "slurm" runtime...
  �[91m
   !! Failed: Less than 2 hosts found in environment! !! 
  �[0m
  # Trying to setup LOCAL runtime instead...
  # Success!
  # Starting the Orchestrator...
  # Success!
  
  # Use this command to shutdown database if not terminated correctly:
  # $(smart dbcli) -h 127.0.0.1 -p 6557 shutdown
  
  # Configuration of runtime environment:
  #   Scheduler: local
  #   Hosts:     ['jwb0129']

This returns error regarding runtime, as the launcher is switched to local:

  File "/p/scratch/deepwing/yuningw/04-Reproduce-SOD2D/examples/juwels_gpu_cyl_rl/ywsmf/runtime.py", line 230, in launch_models
    raise ValueError('srun launcher only supported for SLURM scheduler!')
ValueError: srun launcher only supported for SLURM scheduler!

Background:

In config.yml:

      smartsim:
        n_dbs: 1
        network_interface: "ib0"
        run_command: "srun"
        launcher: "slurm"

For sbatch script:

      #SBATCH -t 01:00:00
      #SBATCH -p develbooster
      #SBATCH --exclusive
      #SBATCH -N 1
      #SBATCH --ntasks-per-node=1
      #SBATCH --cpus-per-task=48

My questions:

  1. How to solve this issue?
  2. Why requiring 2 hosts?
  3. Any tricks for SBATCH?
  4. Any room for improvement?
    Please let me know your thoughts on this, thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions