feat: BundleCE for job grouping#8476
Draft
AcquaDiGiorgio wants to merge 51 commits intoDIRACGrid:integrationfrom
Draft
feat: BundleCE for job grouping#8476AcquaDiGiorgio wants to merge 51 commits intoDIRACGrid:integrationfrom
AcquaDiGiorgio wants to merge 51 commits intoDIRACGrid:integrationfrom
Conversation
Untested code. Plus the BundleCE should change to contact the service
Improved code legibility
Not finished still
More cases must to be tested such as killing and rescheduling bundled jobs Added also the Alexandre's AREXEnhanced CE
Outputs are obtained once and the rest grab them locally
Add a debugging monitoring info (temporary)
This approach is mainly for debugging purposes
…jobs to fill the bundle Also added some functionalities for later use with an agent. These have not been tested
Accommodated Service and CE to the schema of the DB
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See #8475
Summary
This "system" is comprised of 3 main components, the CE, the Service and the DB. It also has an Agent, but is not a critical piece.
It has been developed with the real CE being the
AREX, but in principle it should work with any other CE that implementsgetJobOutput.The main idea is to receive the jobs through the
BundleCE, which contacts theBundleServiceto store them at theBundleDB. When a certain number of processors is reached, theBundleServicesends that bundle to the Real CE.Job status retrieval is done through the service, which obtains the
BundleIDof the specificJobIDrequested and contacts the real CE for their status.The job output is obtained directly from the
BundleCE. Each job obtains theirs without going through theBundleService.The system
Bundle CE
The
BundleCEis main piece of the puzzle. This Computing Element in charge of contacting the service to store the jobs in bundles. This is a virtual CE that serves as a "proxy" between the agent uploading the jobs and the real CE, passing through the Bundle Service.It works the same way any other CE, and mimics the idea of the
PoolCE, acting as in intermediary.In theory, with just the Computing Element should be enough to operate this system, but due to bundle persistence issues, we need the rest of the parts. Having only the CE will also complicate things, as it would require having only a singular instance of the
BundleCEclass, containing the information of every bundle in memory.Bundle DB
The
BundleDBis in charge of storing the individual jobs in multiple bundles following certain rules. This lets us have a stateless system, robust between restarts or to sudden shutdowns.This database saves plenty of information such as the job location and their outputs, the proxy's location and the CE information.
To select which bundle each job will be stored at, it matches the real CE the job it wants to submit to and checks with the CE information of every bundle stored in the DB. If there is no bundle available it creates a new one with this job in it.
The ID of the Bundle serves as the
PilotStamp, as multiple jobs reach the same bundle.--- title: BundleDB --- erDiagram direction LR BundlesInfo { string BundleID PK int ProcessorSum int MaxProcessor string Site string CE string Queue text CEDict string TaskID enum Stauts string Site set Flags datetime FirstTimestamp datetime LastTimestamp } JobToBundle { string JobID PK strub BundleID FK int DiracID string ExecutablePath string Outupts int Processors } JobInputs { string InputID PK string JobID FK string InputPath } BundlesInfo ||--o{ JobToBundle : BundleID JobToBundle ||--o{ JobInputs : JobIDBundle Service (Bundler)
The Service serves as the bridge between all of the components. The main tasks it manages are:
BundleCEBundle Agent (BundleManager)
The agent serves as a supplement for the system. In principle, it is not mandatory to have it, but it helps for 2 specific cases (at the time of writing).
First, stalled bundles. The bundle might be able to store up to X jobs before submission, but sometimes this number might take too much time to reach due to a low influx of jobs. This agent checks the last time a job was submitted to each bundle and forces a submission if it is taking too long.
Second, checking bundle heartbeat. When the bundle is sent, the best way of checking if it is still alive is by checking its status and reporting it to the
JobDB. This could be done through the CE or service, but as the agent only gets executed once every x seconds, checking it once through the agent is much less CPU intensive than though the other options.Known limitations
Not a priority, as it should be the proxy of the pilot every time (as far as I'm aware).
PushJobAgentis in the machine "A" and it sends a job, the proxy is stored at the/tmpdirectory of machine "A"; then, for this system to work, we need to setup theBundle Servicein machine "A", as it is the service the one submitting the bundle.As storing the proxy at the DB is out of the radar, storing the DN and group of the proxy and then matching it through the
ProxyManagerClientmight be the best way to go.Another possibility could be to use
getRemoteCredentialsfrom the service. I need to look into this.The outputs are stored at/tmp/bundles(modifiable by the administrator) first and then moved to the working directory of each of the jobs. This process is painfully slow and could collapse a machine if it has a tiny partition for/tmp, which is quite common.By changing the behaviour ofgetJobOutput, we might be able to let each job retrieve their output directly.When the bundle finishes, only one of the jobs downloads the outputs, the rests wait until it finishes. The file movement is done separately by each job.If we can change the behaviour ofgetJobOutputwe can solve this one too.If not, this could be a very difficult limitation to overcome.
JobAgentmight not be able to see finalised bundles, as it checks the dictionaryself.taskResultsof the CE instead of callinggetJobOutput.This just requires some testing, as the current idea is viable. However, I think this solution should only exist along the
PushJobAgent, as pilots are already able to manage parallel job execution properly.At the moment, it is accepted as a forced limitation, but should addressed.
TODOs
BundleCEBEGINRELEASENOTES
*Resources
NEW: BundleCE for bundled job submission.
NEW: AREXEnhancedCE for recursive job output retrieval.
FIX: Bug at AREXCE with executable name while constructing
wrapperContent.*WorkloadManagementSystem
NEW: Bundled Job submission using the new BundleDB, BundleHandler, BundleClient and BundleAgent components.
ENDRELEASENOTES