-
Notifications
You must be signed in to change notification settings - Fork 121
OpenMPI and Docker containers #378
Description
Problem Description
I am trying to run a simple OpenMPI test code using mpi4py and this docker image (aalati/mpi_ex_mit). The container includes a python script with mpi4py that checks the nodes communications and a shell script that passes the python script to mpiexec.
When I submit the job to the pool using shipyard I get the following error.
Error response from daemon: Cannot kill container: simjob-aalati-mpi_ex_mit: No such container: simjob-aalati-mpi_ex_mit
Error: No such container: simjob-aalati-mpi_ex_mit
Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts.
Warning: Permanently added '[10.0.0.6]:23' (ECDSA) to the list of known hosts.
**********************************************************
Open MPI does not support recursive calls of mpiexec
**********************************************************
I am not sure if the problem comes from the pool or job configuration or from the construction of the container (even if this is less likely as it works as expected when I run it locally). I have attached below the Dockerfile and the jobs configuration in case useful.
I would appreciate any advice on the issue. Thank you very much for your help.
Batch Shipyard Version
I am using the version on Azure CloudShell for now.
Redacted Configuration
jobs.yaml
job_specifications:
- auto_complete: true
id: simjob
tasks:
- docker_image: aalati/mpi_ex_mit
additional_docker_run_options: [-w /root/codes]
multi_instance:
num_instances: pool_current_dedicated
mpi:
runtime: openmpi
executable_path: /usr/bin/mpiexec
processes_per_node: nproc
options:
- -mca btl_base_warn_component_unused 0
command: /bin/bash -c "bash -i ./mpi_example.sh"
Dockerfile
# Filename: Dockerfile
FROM ubuntu:18.04
COPY ssh_config /root/.ssh/config
RUN apt-get -y update && \
apt-get -y install gcc gfortran g++ libopenmpi-dev wget openssh-server openssh-client \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
# configure ssh server and keys
&& mkdir /var/run/sshd \
&& ssh-keygen -A \
&& sed -i 's/PermitRootLogin without-password/PermitRootLogin yes/' /etc/ssh/sshd_config \
&& sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd \
&& ssh-keygen -f /root/.ssh/id_rsa -t rsa -N '' \
&& chmod 600 /root/.ssh/config \
&& chmod 700 /root/.ssh \
&& cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /root/miniconda.sh && \
bash ~/miniconda.sh -b && \
export PATH="/root/miniconda3/bin:$PATH" && \
conda init bash
RUN . /root/.bashrc && \
conda create --name py37 python=3.7 -y && \
conda activate py37 && \
conda install numpy scipy matplotlib tabulate seaborn statsmodels -y && \
#conda install -c conda-forge interpolation fredapi time bdw-gc r-operator.tools r-sys -y && \
wget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.3.tar.gz -O /root/mpi4py-3.0.3.tar.gz && \
tar -zxf /root/mpi4py-3.0.3.tar.gz && \
cd mpi4py-3.0.3 && \
which mpicc python && \
python setup.py build && \
python setup.py install && \
conda install -c anaconda openpyxl && \
conda install -c conda-forge interpolation fredapi time -y
COPY codes /root/codes
RUN chmod u+x /root/codes/mpi_example.sh
WORKDIR /root/codes
RUN echo "conda activate py37" >> /root/.bashrc
#ENTRYPOINT ["bash","-i","./mpi_example.sh"]
# make sshd listen on 23 and run by default
EXPOSE 23
CMD ["/usr/sbin/sshd", "-D", "-p", "23"]
ssh_config
Host 10.*
Port 23
StrictHostKeyChecking no
UserKnownHostsFile /dev/null