Skip to content

MPI/SRUN usage on a single cluster node #40

@Fantasy98

Description

@Fantasy98

Background: DRL using 4CFDs-Environments

  • Using a computation node with 4GPUs and 64CPUS

  • Using the smartsim configuration:

    smartsim:
      n_dbs: 1
      network_interface: "lo"
      run_command: "mpirun"
      launcher: "local"
    

Encountered issue:

  • All of the CFD are run at GPU devices device:0, which leads to low-efficiency usage of the computation resources.

  • The rank_file (i.e., .env000.txt) for launching mpirun is:

    rank 0=alvis4-05 slot=1        
    

Clearly, there is no binding of GPU devices.

Suggestions:

  • Modify the usage of local / slurm configuration of SmartFlow to adopt this usage. We may consider any of the paths:
  1. Make srun able to use for single cluster node
  2. Incorporate the GPU-related arguments in rank_file.

I am currently working on the option#2, and I shall keep you posted by this issue. @soaringxmc

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions