-
Notifications
You must be signed in to change notification settings - Fork 5
Fix callback server death on task overage #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If a caller attempts to run too many tasks on the callback server at once when run standalone, the JobRunner._watch loop will exit and the callback server will hang forever. This duplicates (for now) the job counting logic in the callback server itself so it doesn't submit jobs guaranteed to fail to the _watch loop. Eventually all this should likely be refactored. Also catches and reports bad module name errors.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #108 +/- ##
==========================================
+ Coverage 80.43% 80.55% +0.11%
==========================================
Files 13 13
Lines 1140 1147 +7
==========================================
+ Hits 917 924 +7
Misses 223 223 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
the uncovered stuff is again due to the callback server running in a different process |
|
codacy is complaining about asserts in tests again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical issue where the callback server would hang forever when too many jobs were submitted, causing the JobRunner._watch loop to exit. The fix introduces job counting logic in the callback server to prevent submitting jobs that are guaranteed to fail.
Key changes:
- Adds configurable max_tasks parameter to prevent job submission when at capacity
- Improves error handling for bad module names and method parsing
- Enhances test coverage for job limits and error conditions
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| JobRunner/callback_server.py | Implements job counting logic and improved error handling to prevent server hangs |
| JobRunner/config.py | Adds max_tasks parameter to Config class constructor |
| JobRunner/Callback.py | Passes max_tasks parameter through Callback class |
| JobRunner/JobRunner.py | Updates callback server initialization to include max_tasks |
| test/test_callback_server_integration.py | Adds comprehensive tests for job limits and error conditions |
| test/test_callback_server.py | Updates test configuration to include new jobcount and maxjobs parameters |
| pyproject.toml | Adds ipython development dependency |
| # the ip, which doesn't work if you're behind a NAT. | ||
| # All of this needs a rethink / refactor | ||
| s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) | ||
| s.connect(("gmail.com", 80)) |
Copilot
AI
Aug 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using gmail.com as a target for socket connection is unreliable and creates an external dependency in tests. Consider using a more stable service or implementing a local network discovery method.
| s.connect(("gmail.com", 80)) | |
| # Use a non-routable IP address to determine local IP without external dependency | |
| s.connect(("10.255.255.255", 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gmail.com isn't stable nothing is... also this is how it's done in other places in the codebase, it was just copied from there
| _check_rpc_token(app, token) | ||
| job_id = str(uuid.uuid1()) | ||
| data["method"] = "%s.%s" % (module, method[1:-7]) | ||
| job_id = str(uuid.uuid4()) |
Copilot
AI
Aug 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from uuid.uuid1() to uuid.uuid4() should be consistent throughout the codebase. However, there's still a uuid.uuid1() call on line 93 in the test file that wasn't updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 93 of the test file is a switch from uuid1 to uuid4
| "error": exception_message, | ||
| "code": "123", | ||
| "message": exception_message, | ||
| "name": "CallbackServerError", |
Copilot
AI
Aug 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before the comma - should be 'CallbackServerError', not 'CallbackServerError', (note the trailing space).
| "name": "CallbackServerError", | |
| "name": "CallbackServerError", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed locally, it'll show up in a future PR
|
|
||
| def _restore_env(envvar, old_val): | ||
| if old_val is None: | ||
| os.environ.pop(envvar) |
Copilot
AI
Aug 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using os.environ.pop() without a default value will raise KeyError if the environment variable doesn't exist. This could cause test failures in certain environments.
| os.environ.pop(envvar) | |
| os.environ.pop(envvar, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it's guaranteed the env var is present.
If a caller attempts to run too many tasks on the callback server at once when run standalone, the JobRunner._watch loop will exit and the callback server will hang forever. This duplicates (for now) the job counting logic in the callback server itself so it doesn't submit jobs guaranteed to fail to the _watch loop. Eventually all this should likely be refactored.
Also catches and reports bad module name errors.