Skip to content

Pretty Filestore#5484

Open
adamnovak wants to merge 12 commits intomasterfrom
issues/5401-pretty-filestore
Open

Pretty Filestore#5484
adamnovak wants to merge 12 commits intomasterfrom
issues/5401-pretty-filestore

Conversation

@adamnovak
Copy link
Copy Markdown
Member

@adamnovak adamnovak commented Apr 1, 2026

This will fix #5401.

It still needs the hints to actually be plugged in to the WDL/CWL runners.

The code here is mostly synthetic, generated with Anthropic Claude Opus 4.6. It looked right at first glance but I also caught it trying to sneak in race conditions, so it needs thorough review.

It also needs some manual testing to run a real WDL or CWL workflow and look at the file store and agree that's what we want.

Changelog Entry

To be copied to the draft changelog by merger:

  • Toil now organizes file store files using hints about e.g. workflow step names.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

Copy link
Copy Markdown
Member Author

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the synthetic code suffers from bad style that ought to be fixed.

Comment on lines +325 to +326
may be inefficient; hints are intended for human-navigable
categorization, not for high-throughput bulk file creation.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like word salad of limited explanatory value.

Comment thread src/toil/jobStores/aws/jobStore.py Outdated

###################################### FILES API ######################################

def _claim_hinted_s3_slot(self, sanitized_hints: list[str], basename: str) -> str:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here is S3; does it need to be in the name?

Comment thread src/toil/jobStores/aws/jobStore.py Outdated
Comment on lines +463 to +464
writers on different nodes racing for the same hint+basename will
each land in a distinct numbered slot.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional writes aren't sufficient to provide this by themselves; all they do is avoid clobbering and fail a write instead.

Comment thread src/toil/jobStores/aws/jobStore.py Outdated

:param sanitized_hints: Already-sanitized hint path components.
:param basename: The file basename.
:returns: The file_id (e.g. ``lions/tigers/bears/0``).
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't use test case content as an example, because the reader will not be familiar with the test case.

Comment on lines +482 to +487
try:
n = int(parts[0])
if n >= next_n:
next_n = n + 1
except (ValueError, IndexError):
continue
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accomplishing some important stuff around making sure that files with hint lists that are prefixes of each other work properly, which isn't explained in comments.

Comment thread src/toil/jobStores/abstractJobStore.py Outdated

# Characters allowed in sanitized hints: these survive quote() unchanged
# and are safe in filesystem paths and S3 keys.
HINT_SAFE_RE = re.compile(r"[^a-zA-Z0-9_\-.]")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\ is not a great character to have in your paths.

Comment thread src/toil/jobStores/abstractJobStore.py Outdated
Comment on lines +1117 to +1120
# +1 for the separator between components
if total_length + len(sanitized) + (1 if result else 0) > self.MAX_HINTS_PATH_LENGTH:
break
total_length += len(sanitized) + (1 if result else 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fiddly and repetitive and the comment does not get a the real point here, which is to take hints until the separator and the next hint no longer fit under the limit, then stop.

Comment thread src/toil/jobStores/abstractJobStore.py Outdated
Comment on lines +1302 to +1305
Large numbers of files stored under the same non-empty hints
may be inefficient, as slot allocation scans existing entries;
hints are intended for human-navigable categorization, not for
high-throughput bulk file creation.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets repeated a lot, so in addition to this part being of dubious conceptual coherence, it might benefit from the parameter description being shorter overall, with the concept of hints explained in more detail elsewhere.

Comment thread src/toil/jobStores/fileJobStore.py Outdated
concurrent writers on a shared filesystem will never collide.

After deletion, the empty numbered directory persists as a tombstone
that prevents the slot from being reallocated. This is critical
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Critical" is a strong word here. I'd say it's "required".

Comment thread src/toil/jobStores/fileJobStore.py Outdated
def _get_unique_file_path_with_hints(self, fileName, jobStoreID, cleanup, hints):
"""
Get a unique file path for a file stored under a human-readable
hint-derived directory hierarchy.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being "hint-derived" isn't a real concept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make file store layout pretty using hints

1 participant