Merged
Conversation
Contributor
anyin233
commented
Jan 6, 2026
- Allow the submit api can start multiple replica (for SwarmX)
- Add deletion for instance and offline workers
- Add graceful shutdown for workers (when receive SIGINT, notify head node set its status to OFFLINE, then exit)
This commit adds comprehensive deletion support for PyLet instances
across all interfaces (HTTP API, Python API, and CLI).
Database Layer (db.py):
- Added delete_instance(instance_id) - delete by ID
- Added delete_instance_by_name(name) - delete by name
- Added delete_all_instances(status_filter) - bulk deletion
- Foreign key CASCADE automatically deletes allocations
Controller Layer (controller.py):
- Added delete_instance(instance_id)
- Added delete_instance_by_name(name)
- Added delete_all_instances(status_filter)
- Pokes scheduler after deletion to handle freed resources
HTTP API (server.py):
- DELETE /instances/{instance_id} - delete by ID (returns 204)
- DELETE /instances/by-name/{instance_name} - delete by name (returns 204)
- DELETE /instances?status=X - delete all with optional status filter
Python Sync API (_sync_api.py):
- pylet.delete(name) or pylet.delete(id=...)
- pylet.delete_all(status="COMPLETED")
Python Async API (aio/__init__.py):
- await pylet.aio.delete(name) or await pylet.aio.delete(id=...)
- await pylet.aio.delete_all(status="COMPLETED")
HTTP Client (client.py):
- client.delete_instance(instance_id)
- client.delete_instance_by_name(name)
- client.delete_all_instances(status)
CLI (cli.py):
- pylet delete --instance-id <id>
- pylet delete --name <name>
- pylet delete --all [--status COMPLETED]
- Includes confirmation prompts (--yes to skip)
Documentation (docs/instance-state-machine.md):
- Added comprehensive state machine visualization
- Documented all state transitions and lifecycle
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated documentation to include the new instance deletion feature: API Reference (docs/api_reference.md): - Added pylet.delete(name, *, id) function documentation - Added pylet.delete_all(*, status) function documentation - Updated async API section with delete methods - Updated API summary table with delete functions CLI Reference (docs/cli_reference.md): - Added comprehensive pylet delete command documentation - Documented all deletion options (--instance-id, --name, --all, --status, --yes) - Added safety features section (confirmation prompts) - Added examples for single and bulk deletion - Added best practices for safe deletion workflows - Updated command summary table 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit adds comprehensive deletion support for workers with
safety constraints ensuring only OFFLINE workers can be deleted.
Database Layer (db.py):
- Added delete_worker(worker_id) - delete by ID
- Added delete_all_offline_workers() - bulk deletion
- Foreign key CASCADE automatically deletes GPU inventory
Controller Layer (controller.py):
- Added delete_worker(worker_id) with OFFLINE status check
Returns (success, error) tuple to differentiate "not_found" vs "online"
- Added delete_all_offline_workers() with in-memory state cleanup
- Cleans up desired_gen and gen_events for deleted workers
HTTP API (server.py):
- DELETE /workers/{worker_id} - Returns 204/404/400
400 error if worker is not OFFLINE
- DELETE /workers - Delete all OFFLINE workers
Python Sync API (_sync_api.py):
- pylet.delete_worker(worker_id)
Raises ValueError if worker is not OFFLINE
- pylet.delete_all_offline_workers()
Python Async API (aio/__init__.py):
- await pylet.aio.delete_worker(worker_id)
- await pylet.aio.delete_all_offline_workers()
HTTP Client (client.py):
- client.delete_worker(worker_id)
Returns False if not found, raises ValueError if not OFFLINE
- client.delete_all_offline_workers()
CLI (cli.py):
- pylet delete-worker --worker-id <id>
- pylet delete-worker --all-offline
- Confirmation prompts (--yes to skip)
- Clear error messages for non-OFFLINE workers
Documentation:
- Updated API reference with worker deletion functions
- Updated CLI reference with delete-worker command
- Added examples and safety notes
Safety Features:
- ONLY OFFLINE workers can be deleted
- ONLINE and SUSPECT workers are protected (400 error)
- Confirmation prompts in CLI
- Automatic cleanup of in-memory controller state
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add optional delete parameter (default False) to cancel() method. When True, deletes the instance after cancellation is requested. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replica support:
- Add `replicas` parameter to submit instances (default 1)
- When replicas > 1, instances are named `{base_name}-{index}`
- API returns `instance_id` for single replica, `instance_ids` for multiple
- Update sync/async APIs, CLI, client, controller, and server
Port configuration:
- Add `--port` option to CLI for customizing head node API port and worker HTTP port
- Worker accepts `http_port` parameter for log retrieval server
Bug fix:
- Handle 503/404 responses gracefully when fetching logs from unavailable workers
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When workers receive SIGINT or SIGTERM (Ctrl+C), they now:
1. Catch the signal and trigger graceful shutdown
2. Notify the head node via POST /workers/{id}/unregister endpoint
3. Head node immediately marks worker as OFFLINE
This enables faster failover compared to waiting for heartbeat timeout.
Changes:
- controller.py: Add unregister_worker() method
- server.py: Add POST /workers/{worker_id}/unregister endpoint
- worker.py: Add signal handlers and graceful shutdown logic
- test_controller.py: Fix test for auto-generated instance names
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When no name is provided, use instance_id[:8] as the auto-generated name (original behavior) instead of a separate UUID. This ensures the name is derived from the actual instance ID for single-replica submissions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update documentation to align with recent API changes: - api_reference.md: Add full submit() signature with target_worker, gpu_indices, exclusive, labels, env, venv, and replicas parameters - cli_reference.md: Add --port option to start command for both head and worker nodes - cli_reference.md: Add --replicas, --target-worker, --gpu-indices, --exclusive, --label, --env, --venv options to submit command Also includes: - db.py: Fix foreign key constraint when deleting offline workers - worker.py: Raise CancelledError after graceful shutdown notification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change heartbeat and monitor loops to check shutdown_event instead of while True - Re-raise CancelledError in heartbeat loop when shutdown is triggered - Add graceful instance termination before notifying head node - Keep finally block to raise CancelledError for proper cleanup 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Workers now report their HTTP port during registration, which is stored in the database. When proxying log requests, the server uses the worker's registered port instead of hardcoding config.WORKER_HTTP_PORT. This fixes 503 errors when fetching logs from workers started with custom ports. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add missing documentation for: - pylet.instances() labels parameter for filtering - Instance.cancel() delete parameter - Instance properties: display_status, gpu_indices, exclusive, labels, env, target_worker - WorkerInfo.gpu_indices_available property - Async API differences (instances lacks labels, cancel has delete param) - Updated API Summary table with Instance properties 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of using static WORKER_PORT_MIN/MAX config, the instance port range is now calculated from the worker's HTTP port. When a worker starts with --port 16000, instances get ports 16001-16100. This allows multiple workers on the same host without port conflicts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
BREAKING CHANGE: The `replicas` parameter has been removed from all submit functions. To create multiple instances, use a loop. - Remove replicas parameter from controller, server, client, CLI - Simplify return types (no more Union[str, List[str]]) - Update documentation with deprecation notice - Add .ticktick-project to .gitignore 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SecretSettler
requested changes
Feb 2, 2026
Contributor
SecretSettler
left a comment
There was a problem hiding this comment.
Could you please remove the replica feature and resolve the conflict? Maybe place replica as an example?
Contributor
Author
|
Replica support removed |
Resolve conflicts in 10 files keeping feature branch functionality (deletion APIs, graceful shutdown, http_port, port range derivation) while accepting main's type-hint modernization, formatting, tooling (ruff/mypy config), and documentation additions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Contributor
Author
|
Conflict resolved |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.