Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -163,4 +163,4 @@ cython_debug/

# PyLet Test
*.err
*.log
*.log.ticktick-project
10 changes: 10 additions & 0 deletions .ticktick-project
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# TickTick Project Configuration
project_id: 695d14978f085c53335c8476
project_name: SwarmPilot PyLet Integration
prefix: PYL
next_id: 3

# Task ID Registry (maps custom ID -> TickTick task ID)
tasks:
PYL-001: 6964b9198f084eb6575c1fed
PYL-002: 697ff9948f081d478584b6cd
225 changes: 216 additions & 9 deletions docs/api_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,12 @@ def submit(
gpu: int = 0,
cpu: int = 1,
memory: int = 512,
target_worker: Optional[str] = None,
gpu_indices: Optional[List[int]] = None,
exclusive: bool = True,
labels: Optional[Dict[str, str]] = None,
env: Optional[Dict[str, str]] = None,
venv: Optional[str] = None,
) -> Instance
```

Expand All @@ -111,21 +117,51 @@ Submit a new instance.
**Args:**
- `command`: Shell command string, or list of args (auto shell-escaped)
- `name`: Optional instance name for service discovery
- `gpu`: GPU units required (default 0)
- `gpu`: GPU units required (default 0, ignored if `gpu_indices` specified)
- `cpu`: CPU cores required (default 1)
- `memory`: Memory in MB required (default 512)
- `target_worker`: Place on specific worker node
- `gpu_indices`: Request specific physical GPU indices
- `exclusive`: If `False`, GPUs don't block allocation pool (default `True`)
- `labels`: Custom metadata dict
- `env`: Environment variables to set
- `venv`: Path to pre-existing virtualenv (must be absolute path)

**Returns:** `Instance` handle
**Returns:**
- `Instance` handle for the submitted instance.

**Raises:**
- `NotInitializedError`: `init()` not called
- `ValueError`: Invalid command or resources

**Example:**
```python
# Basic usage
instance = pylet.submit("echo hello", cpu=1)
instance = pylet.submit("vllm serve model --port $PORT", name="vllm", gpu=1, memory=4096)
instance = pylet.submit(["python", "-c", "print('hello')"], cpu=1)

# Target specific worker and GPU indices
instance = pylet.submit(
"sllm-store start",
target_worker="gpu-0",
gpu_indices=[0, 1, 2, 3],
exclusive=False,
labels={"type": "sllm-store"},
)

# Use a virtualenv
instance = pylet.submit(
"python train.py",
venv="/home/user/my-venv",
gpu=1,
)

# Deploy multiple instances (use a loop)
instances = []
for i in range(3):
inst = pylet.submit(f"python worker.py", name=f"worker-{i}", gpu=1)
instances.append(inst)
```

---
Expand Down Expand Up @@ -167,13 +203,15 @@ instance = pylet.get(id="abc-123-def")
def instances(
*,
status: Optional[str] = None,
labels: Optional[Dict[str, str]] = None,
) -> List[Instance]
```

List all instances.

**Args:**
- `status`: Filter by status (e.g., `"RUNNING"`, `"PENDING"`)
- `labels`: Filter by labels (all specified labels must match)

**Returns:** List of `Instance` handles

Expand All @@ -184,6 +222,7 @@ List all instances.
```python
all_instances = pylet.instances()
running = pylet.instances(status="RUNNING")
gpu_instances = pylet.instances(labels={"type": "gpu-worker"})
```

---
Expand All @@ -203,6 +242,107 @@ List all registered workers.

---

### `pylet.delete`

```python
def delete(
name: Optional[str] = None,
*,
id: Optional[str] = None,
) -> None
```

Delete an instance by name or ID.

**Args:**
- `name`: Instance name (positional or keyword)
- `id`: Instance ID (keyword only)

**Raises:**
- `NotInitializedError`: `init()` not called
- `NotFoundError`: Instance not found
- `ValueError`: Neither `name` nor `id` provided

**Example:**
```python
pylet.delete("my-instance")
pylet.delete(id="abc-123-def")
```

---

### `pylet.delete_all`

```python
def delete_all(*, status: Optional[str] = None) -> int
```

Delete all instances, optionally filtered by status.

**Args:**
- `status`: Only delete instances with this status (e.g., `"COMPLETED"`, `"FAILED"`, `"CANCELLED"`)

**Returns:** Number of instances deleted

**Raises:**
- `NotInitializedError`: `init()` not called

**Example:**
```python
# Delete all completed instances
count = pylet.delete_all(status="COMPLETED")
print(f"Deleted {count} instances")

# Delete all instances (use with caution!)
count = pylet.delete_all()
```

---

### `pylet.delete_worker`

```python
def delete_worker(worker_id: str) -> None
```

Delete a worker by ID. **Only OFFLINE workers can be deleted.**

**Args:**
- `worker_id`: Worker ID to delete

**Raises:**
- `NotInitializedError`: `init()` not called
- `NotFoundError`: Worker not found
- `ValueError`: Worker is not OFFLINE (only OFFLINE workers can be deleted)

**Example:**
```python
pylet.delete_worker("worker-123")
```

---

### `pylet.delete_all_offline_workers`

```python
def delete_all_offline_workers() -> int
```

Delete all workers with OFFLINE status.

**Returns:** Number of workers deleted

**Raises:**
- `NotInitializedError`: `init()` not called

**Example:**
```python
count = pylet.delete_all_offline_workers()
print(f"Deleted {count} offline workers")
```

---

## Class: `Instance`

Returned by `pylet.submit()` and `pylet.get()`. Represents a handle to an instance.
Expand Down Expand Up @@ -239,6 +379,42 @@ def exit_code(self) -> Optional[int]
```
Process exit code when terminal, `None` otherwise.

```python
@property
def display_status(self) -> str
```
User-facing status. Returns `"CANCELLING"` while cancellation is in progress, otherwise same as `status`.

```python
@property
def gpu_indices(self) -> Optional[List[int]]
```
Allocated GPU indices when assigned/running, `None` otherwise.

```python
@property
def exclusive(self) -> bool
```
Whether instance has exclusive GPU access. Default `True`.

```python
@property
def labels(self) -> Dict[str, str]
```
User-defined labels. Returns empty dict if none set.

```python
@property
def env(self) -> Dict[str, str]
```
User-defined environment variables. Returns empty dict if none set.

```python
@property
def target_worker(self) -> Optional[str]
```
Target worker constraint if set, `None` otherwise.

### Methods

#### `Instance.wait_running`
Expand Down Expand Up @@ -277,11 +453,14 @@ Block until instance reaches terminal state (`COMPLETED`, `FAILED`, `CANCELLED`)
#### `Instance.cancel`

```python
def cancel(self) -> None
def cancel(self, delete: bool = False) -> None
```

Request instance cancellation. Returns immediately (cancellation is async).

**Args:**
- `delete`: If `True`, delete the instance after cancellation completes (default `False`)

**Raises:**
- `InstanceTerminatedError`: Instance already in terminal state

Expand Down Expand Up @@ -382,6 +561,12 @@ def memory_available(self) -> int
```
Available memory in MB.

```python
@property
def gpu_indices_available(self) -> List[int]
```
List of available GPU indices.

---

## Cluster Management
Expand Down Expand Up @@ -571,16 +756,20 @@ async def main():
- `async pylet.aio.init(address: str = "http://localhost:8000") -> None`
- `async pylet.aio.shutdown() -> None`
- `pylet.aio.is_initialized() -> bool` (sync, no I/O)
- `async pylet.aio.submit(...) -> Instance`
- `async pylet.aio.submit(...) -> Instance` - Same parameters as sync version
- `async pylet.aio.get(...) -> Instance`
- `async pylet.aio.instances(...) -> List[Instance]`
- `async pylet.aio.instances(*, status: Optional[str] = None) -> List[Instance]` - Note: does not support `labels` parameter
- `async pylet.aio.workers() -> List[WorkerInfo]`
- `async pylet.aio.delete(name=None, *, id=None) -> None`
- `async pylet.aio.delete_all(*, status=None) -> int`
- `async pylet.aio.delete_worker(worker_id) -> None`
- `async pylet.aio.delete_all_offline_workers() -> int`

### Async Instance Methods

- `async Instance.wait_running(timeout: float = 300) -> None`
- `async Instance.wait(timeout: Optional[float] = None) -> None`
- `async Instance.cancel() -> None`
- `async Instance.cancel(delete: bool = False) -> None`
- `async Instance.logs(tail: Optional[int] = None) -> str`
- `async Instance.refresh() -> None`

Expand Down Expand Up @@ -657,17 +846,35 @@ with pylet.local_cluster(workers=2, cpu_per_worker=2) as cluster:
| `pylet.init(address)` | Connect to head |
| `pylet.shutdown()` | Disconnect (optional) |
| `pylet.is_initialized()` | Check if connected |
| `pylet.submit(command, *, name, gpu, cpu, memory)` | Submit instance |
| `pylet.submit(command, *, name, gpu, cpu, memory, ...)` | Submit instance |
| `pylet.get(name, *, id)` | Get instance |
| `pylet.instances(*, status)` | List instances |
| `pylet.instances(*, status, labels)` | List instances |
| `pylet.workers()` | List workers |
| `pylet.delete(name, *, id)` | Delete instance |
| `pylet.delete_all(*, status)` | Delete all instances |
| `pylet.delete_worker(worker_id)` | Delete OFFLINE worker |
| `pylet.delete_all_offline_workers()` | Delete all OFFLINE workers |
| `pylet.start(*, address, port, gpu, cpu, memory, block)` | Start head/worker |
| `pylet.local_cluster(workers, *, ...)` | Test cluster |

| Instance Property | Purpose |
|-------------------|---------|
| `instance.id` | Instance UUID |
| `instance.name` | User-provided name |
| `instance.status` | Current status |
| `instance.display_status` | User-facing status (shows CANCELLING) |
| `instance.endpoint` | host:port when running |
| `instance.exit_code` | Exit code when terminal |
| `instance.gpu_indices` | Allocated GPU indices |
| `instance.exclusive` | Exclusive GPU access |
| `instance.labels` | User-defined labels |
| `instance.env` | Environment variables |
| `instance.target_worker` | Target worker constraint |

| Instance Method | Purpose |
|-----------------|---------|
| `instance.wait_running(timeout)` | Block until RUNNING |
| `instance.wait(timeout)` | Block until terminal |
| `instance.cancel()` | Request cancellation |
| `instance.cancel(delete)` | Request cancellation |
| `instance.logs(tail)` | Get logs |
| `instance.refresh()` | Update from server |
Loading