-
Notifications
You must be signed in to change notification settings - Fork 5
replaced squeue calls with sacct for more robust interface #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| out_search = re.search(out_pattern, str(squeue_output)) | ||
| if out_search: | ||
| return out_search.group(1) | ||
| # state_command = f'squeue -j {str(jobid)} -o "%T"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we erase old code instead of commenting it? If it's ever needed again, we can always go back in the git history and figure out what we had before
| # 29319673|COMPLETED| | ||
| # 29319673.batch|COMPLETED| | ||
| # 29319673.0|COMPLETED| | ||
| pattern = f'{jobid}\|([A-Z]+)\|' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jobid is the slurm job number, or the slurm job name? What if someone has the same name for multiple jobs? I don't think slurm technically forbids having the same name twice (even if it is confusing)
Looking at the function header, it seems that is the slurm job number, which corresponds to:
SLURM_JOB_ID (and SLURM_JOBID for backwards compatibility)
The ID of the job allocation.
Maybe we can add a link in the docstring and refer to the section "OUTPUT ENVIRONMENT VARIABLES" here: https://slurm.schedmd.com/sbatch.html
| # deniz: sacct is much better and persistent compared to squeue. Also | ||
| # getoutput returns standard strings compared to byte strings. This | ||
| # allows easier regex | ||
| command = f'sacct -j {str(jobid)} --parsable --format=jobid,State' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much cleaner than what I hacked together this morning for one of my utilities:
squeue -u $(whoami) -o "%Z %30j %T %M" | grep ${PROJECT_BASE} | cut -f 2- -d' ' | sortI wish I had learned sacct earlier... 😭
|
|
||
| @staticmethod | ||
| def job_is_still_running(jobid): | ||
| """Returns a boolean if the job is still running""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't erase the docstring!!
| def job_is_still_running(jobid): | ||
| """Returns a boolean if the job is still running""" | ||
| return psutil.pid_exists(jobid) | ||
| # """Returns a boolean if the job is still running""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, maybe we can directly remove old code rather than just having it commented, that also might make looking at diffs easier.
As we have talked today with @mandresm, step 1 of the more robust Slurm interface is done.
Basically, these additions use more persistent
sacctcommand instead of thesqueuecommands.I also replaced some of the
subprocesscalls for easier regex parsing.Of course, it is important to consider the recent chunk works from @dbarbi before taking everything for granted.
I tested these functions in a unit test suite. Here are the results:
Some takeaways for @dbarbi,
I am not happy with the
esm_runscriptssubmitting thetidyjob even if the run fails for some reason. Eg. first month fails andesm_runscriptskeeps re-submitting the later months. There is neat way of overcomming this. That was the first step in doing that.In the second stage, I will implement a check in the
tidymethod that will look for the output of the previous Slurm job and then it will decide to submit the next job or not.