Skip to content

Fix aws s3 sync traverses excluded directory - takes up a long time #1138#10273

Open
minjcho wants to merge 6 commits intoaws:v2from
minjcho:fix-s3-filter-prune-v2
Open

Fix aws s3 sync traverses excluded directory - takes up a long time #1138#10273
minjcho wants to merge 6 commits intoaws:v2from
minjcho:fix-s3-filter-prune-v2

Conversation

@minjcho
Copy link
Copy Markdown
Contributor

@minjcho minjcho commented May 4, 2026

Issue #, if available:

Fixes #1138, #1117

Description of changes:

The local file walker in s3 sync/cp/mv/rm now consults the user's --include/--exclude filter chain and prunes any directory whose descendants cannot possibly be included.

A 1M-file directory excluded via --exclude 'src/*' drops from ~70s to ~50ms, and FIFOs/sockets inside excluded subtrees no longer produce spurious warnings (rc=2rc=0).

Filter pattern 1M-file walk Δ
--exclude excluded/* (canonical, prune fires) 68s → 0.054s 1,262× faster
--exclude '*' (prune at root) 67s → <1ms ~1.6M× faster
--exclude excluded/d[0-4]* (partial prune via char class) 68s → 23s 2.9× faster
--exclude excluded / excluded/ / excluded/d?00/* (filter cannot match anything, or pattern shape blocks prune) 67s → 75s ~12% slower (per-file probe — see below)
No --include/--exclude at all 64s → 65s unchanged (gating skips all new code)

The +12% slowdown row is the cost of the per-file pre-filter that fixes #1117, so a FIFO/socket/0o000 file inside an excluded subtree no longer produces a warning that elevates rc to 2.

When the filter shape lets the algorithm prove a subtree is fully excluded, that cost is eliminated by the directory-level prune. When it cannot, the per-file probe pays for the #1117 fix.

The "no filter" row confirms the gating in should_ignore_file short-circuits cleanly when there are no patterns to evaluate.

Two prior PRs in this area did not merge: #2105, abandoned in 2016, and #5425, which was closed after --exclude '*' --include '*.py' was shown to silently drop nested .py files (discussion_r883991435).

This PR's algorithm is sound: it prunes only when the filter chain provably cannot include any descendant, so PR #5425's regression case is handled automatically. The include <root>/*.py is detected as possibly matching descendants of <root>/foo/, blocking the prune.

filters.py gets three new module-private helpers and one new method on Filter; existing match logic is untouched.

Out of scope:

  • S3 server-side listing optimization
  • filters.py refactoring

Testing:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant