-
Notifications
You must be signed in to change notification settings - Fork 184
Description
User Story
As a representative of the LHCb DIRAC community at Barcelona, we need some kind of system that would let us group single core jobs, so that when it reaches our worker nodes they can be executed parallelly using as many cores as possible.
Feature Description
Currently we are underutilizing our resources, as it is only executing single core jobs in worker nodes with 112 cores. This is harming performance greatly, as we could theoretically reach an 112x improvement.
This issue is due to the fact that these worker nodes do not have external connectivity so we are unable to accept pilot jobs. On top of that, the main simulation program being executed is Gauss, which only uses a singular core for its simulations.
This problem has existed for a while, but we need a solution now more than ever, and with the improvements done to the PushJobAgent, it is the perfect moment to have this.
The idea would be te create some kind of intermediary CE that sits between the PushJobAgent and an AREX CE that receives single core jobs.
Those single core jobs should be grouped or bundled in a singular multiprocessor job that gets sent to the AREX as just 1 job.
Finally the worker node should divide those single core jobs and execute them all at once, utilizing all of the nodes if possible.
Must be easy to configure for administrators and also transparent for the user.
Definition of Done
- Create a system capable of grouping multiple jobs as a singular one that then gets send to the worker node, split and executed.
- Ensure that the system is robust against an
AREX CE. - It has to work when the jobs get submitted through a
PushJobAgent.
Alternatives Considered
Modifying the Matcher so it matches multiple jobs at a time instead of one by one could also work.
Related Issues
No response
Additional Context
No response