Skip to content

Conversation

@denizural
Copy link
Contributor

As we have talked today with @mandresm, step 1 of the more robust Slurm interface is done.
Basically, these additions use more persistent sacct command instead of the squeue commands.
I also replaced some of the subprocess calls for easier regex parsing.
Of course, it is important to consider the recent chunk works from @dbarbi before taking everything for granted.

I tested these functions in a unit test suite. Here are the results:

# first submit a toy MPI job
$ sbatch job.run
Submitted batch job 29324848

$ ./test_esm_tools_slurm.py 29324848
status of the job: PENDING
are we still running the job: False

# insert your favourite SpongeBob later meme 😄 
$ ./test_esm_tools_slurm.py 29324848
status of the job: RUNNING
are we still running the job: True

$ ./test_esm_tools_slurm.py 29324848
status of the job: COMPLETED
already completed         # this line can be moved into a verbose block
are we still running the job: False

Some takeaways for @dbarbi,
I am not happy with the esm_runscripts submitting the tidy job even if the run fails for some reason. Eg. first month fails and esm_runscripts keeps re-submitting the later months. There is neat way of overcomming this. That was the first step in doing that.
In the second stage, I will implement a check in the tidy method that will look for the output of the previous Slurm job and then it will decide to submit the next job or not.

@denizural denizural requested a review from mandresm April 20, 2021 13:19
Base automatically changed from feature/pbs to develop April 23, 2021 12:38
out_search = re.search(out_pattern, str(squeue_output))
if out_search:
return out_search.group(1)
# state_command = f'squeue -j {str(jobid)} -o "%T"'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we erase old code instead of commenting it? If it's ever needed again, we can always go back in the git history and figure out what we had before

# 29319673|COMPLETED|
# 29319673.batch|COMPLETED|
# 29319673.0|COMPLETED|
pattern = f'{jobid}\|([A-Z]+)\|'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobid is the slurm job number, or the slurm job name? What if someone has the same name for multiple jobs? I don't think slurm technically forbids having the same name twice (even if it is confusing)

Looking at the function header, it seems that is the slurm job number, which corresponds to:

SLURM_JOB_ID (and SLURM_JOBID for backwards compatibility)
The ID of the job allocation.

Maybe we can add a link in the docstring and refer to the section "OUTPUT ENVIRONMENT VARIABLES" here: https://slurm.schedmd.com/sbatch.html

# deniz: sacct is much better and persistent compared to squeue. Also
# getoutput returns standard strings compared to byte strings. This
# allows easier regex
command = f'sacct -j {str(jobid)} --parsable --format=jobid,State'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much cleaner than what I hacked together this morning for one of my utilities:

  squeue -u $(whoami) -o "%Z %30j %T %M" | grep ${PROJECT_BASE} | cut -f 2- -d' ' | sort

I wish I had learned sacct earlier... 😭


@staticmethod
def job_is_still_running(jobid):
"""Returns a boolean if the job is still running"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't erase the docstring!!

def job_is_still_running(jobid):
"""Returns a boolean if the job is still running"""
return psutil.pid_exists(jobid)
# """Returns a boolean if the job is still running"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, maybe we can directly remove old code rather than just having it commented, that also might make looking at diffs easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants