Skip to content

Tension between Snakemake Data Model and that of Executors like HTCondor #98

@jhiemstrawisc

Description

@jhiemstrawisc

I recently worked on a storage plugin to connect Snakemake with Pelican. Pelican is a software platform for creating data federations, and it has been tightly coupled with the HTCondor cluster scheduler. My ultimate goal in working on the storage plugin was to get Snakemake+HTCondor+Pelican playing nicely together according to HTCondor best practices.

This is roughly the data model I'm trying to achieve:
Image

However, I've come to understand that Snakemake's data model doesn't align with this; rather than delegate the input/output transfer to HTCondor, which knows how to deal with this format and can make decisions about how to handle various things like access tokens, Snakemake will always first fetch the Pelican objects at the Access Point and have HTCondor transfer the files like local file transfers (the ability to delegate these file transfers to HTCondor was introduced by me in #67), making the AP the very bottleneck Pelican tries to solve.

Since this problem has a very similar feel to what we solved with #67, I'm wondering whether there's an opportunity to solve a class of problems rather than knocking these out one by one.

It seems like the common theme here is that Snakemake doesn't have semantics for delegating some things like input/output transfers to its executor plugins. In the HTCondor world view, HTCondor should be responsible for handling as much of the IO transfer stuff as possible because it can make scheduling decisions about what/when stuff should be transferred.

I can imagine a setting in which the HTCondor executor provides a list of transfer protocols it understands (pelican://, osdf://, s3://, ftp://, etc.) and that whenever Snakemake encounters one of these, it understands it should let HTCondor handle these transfers.

One thing that's potentially tricky here is that different HTCondor clusters may support different schemes -- while most HTCondor clusters should support pelican://, not all will. This makes me hesitant to hardcode anything in the executor itself. Maybe the executor plugin interface could also let cluster administrators set up cluster-wide defaults so it's not up users to figure out which schemes are supported?

In the mean time, I plan to do a bit more research to see whether there are similar classes of problems with other schedulers so any solution that gets cooked up doesn't serve only HTCondor integration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions