Skip to content

[DNM][Test CEDA NGINX] Reduce maxthreads#335

Draft
valeriupredoi wants to merge 6 commits into
mainfrom
reduce_maxthreads
Draft

[DNM][Test CEDA NGINX] Reduce maxthreads#335
valeriupredoi wants to merge 6 commits into
mainfrom
reduce_maxthreads

Conversation

@valeriupredoi
Copy link
Copy Markdown
Collaborator

@valeriupredoi valeriupredoi commented May 27, 2026

Mapping the CEDA NGINX threaded concurrency issue

See GHA https://github.com/NCAS-CMS/PyActiveStorage/actions/runs/26487146543 (see right hand tab that records reruns: click Latest batch 4 to open the list of previous (re-)runs): full run 10 jobs running GETS on the same files on NGINX store, 100 max threads/requests/file (this is, of course, theoretical, we only request 100 threads, in practice, it's a fair fewer that get sent):

100 maxthreads

(10-13min for pytest -n 2 tests/test_real_https.py)

  • Attempt 1: 10 concurrent jobs (1,000 requests/file): 6 fails
  • Attempt 2: 6 concurrent jobs (600 requests/file): 4 fails
  • Attempt 3: 4 concurrent jobs (400 requests/file): 2 fails
  • Attempt 4: 2 concurrent jobs (200 requests/file): 2 fails
  • Attempt 5: 20 concurrent jobs (2,000 requests/file): 18 fails

Reducing maxthreads to 10

(10-13min for pytest -n 2 tests/test_real_https.py)

  • Attempt 1: 12 concurrent jobs (120 requests/file): 0 fails
  • Attempt 2: 10 concurrent jobs (100 requests/file): 0 fails

Upping maxthreads to 20

(8-11min for pytest -n 2 tests/test_real_https.py)

  • Attempt 1: 12 concurrent jobs (240 requests/file): 0 fails

Upping maxthreads to 40

(8-9 min for pytest -n 2 tests/test_real_https.py)

  • Attempt 1: 12 concurrent jobs (480 requests/file): 1 fails

Upping maxthreads to 60

(9-11 min for pytest -n 2 tests/test_real_https.py)

  • Attempt 1: 12 concurrent jobs (720 requests/file): 6 fails

The CEDA NGINX storage is starting to show signs of failure around 400 odd concurrent GETs per file, whereas the DKRZ storage is a lot more resilient starting to crumble at 2-3x the number of GETs per file. That said, we using anything more than max_threads = 30 is probably overkill from a number of points: file chunking should be a lot coarser, effective Python threads can't be gained after 30 odd, and, above all, we can always make max_threads an Active parameter.

Varsiha's test script from #333 shows the same behaviour - things starting to cook up at around 50 threads, at 100 things are completely broken; see https://github.com/NCAS-CMS/PyActiveStorage/actions/runs/26521834459

attn: @varsiha-sothilingam @bnlawrence

@valeriupredoi valeriupredoi added the testing testing duh label May 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.13%. Comparing base (cb5b5c9) to head (6071973).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #335   +/-   ##
=======================================
  Coverage   93.13%   93.13%           
=======================================
  Files           7        7           
  Lines         685      685           
=======================================
  Hits          638      638           
  Misses         47       47           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bnlawrence
Copy link
Copy Markdown
Collaborator

I think we have to start being more clear about what is stressing what. PyActiveStorage threads are one thing, but from nginx storage point of view, all they are seeing is standard range-gets from a client ... albeit a lot at a time ... we need to make that clear to the storage folks: while the ultimate client might be pyactivestorage, what they are seeing (and where their load balancing is failing) is on legitimate range-gets (from reductionist), and this could well replicate what ESGF-NG life looks like - with or without reductionist.

@valeriupredoi valeriupredoi mentioned this pull request May 28, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing testing duh

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants