Skip to content

Conversation

@markphelps
Copy link
Contributor

Summary

Addresses review feedback from #2641 by replacing boolean "soup" with strongly-typed Rust enums across four key areas of the coglet codebase.

Changes

1. Predictor Method Characteristics (predictor.rs)

Before: 5 boolean fields

  • is_async, is_async_gen, has_train, is_train_async, is_standalone_function

After: PredictorKind enum with nested structure

enum PredictorKind {
    Class { predict: PredictKind, train: TrainKind },
    StandaloneFunction(PredictKind),
}

enum PredictKind { Sync, Async, AsyncGen }
enum TrainKind { None, Sync, Async }

Benefits: Invalid combinations like is_async=false + is_async_gen=true are now impossible at compile time.

2. Prediction Results (worker.rs)

Before: success: bool + error: Option<String> with string matching

if result.error.as_deref() == Some("Cancelled") { ... }

After: PredictionOutcome enum

enum PredictionOutcome {
    Success { output: Value, predict_time: f64 },
    Failed { error: String, predict_time: f64 },
    Cancelled { predict_time: f64 },
}

Benefits: Type-safe pattern matching, no string comparisons for cancellation detection.

3. Worker Bridge Mode (worker_bridge.rs)

Before: is_train: bool

After: HandlerMode enum

enum HandlerMode { Predict, Train }

Benefits: Explicit intent, clearer than boolean flag.

4. Slot State (worker_bridge.rs)

Before: 4 fields tracking state

  • cancelled: bool, in_progress: bool, is_async: bool, async_future: Option<Py<PyAny>>

After: SlotState enum with proper state machine

enum SlotState {
    Idle,
    SyncPrediction { cancelled: bool },
    AsyncPrediction { future: Py<PyAny>, cancelled: bool },
}

Benefits: State machine clarity, cancellation context preserved for choosing the right cancel mechanism (SIGUSR1 vs future.cancel()).

Impact

  • Type Safety: Invalid states are impossible at compile time
  • Readability: Self-documenting code with clear semantics
  • Maintainability: Easier to extend with new variants
  • No External Changes: Internal Rust refactoring only, no API impact

Testing

  • ✅ Compiles cleanly with no warnings (cargo check --all)
  • ✅ All existing functionality preserved
  • ✅ No changes to bridge protocol or Python API

Stats

  • 3 files changed
  • +385 insertions, -268 deletions

Replace boolean soup with strongly-typed enums across four areas:

- PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags
  in predictor.rs (is_async, is_async_gen, has_train, is_train_async,
  is_standalone_function)

- PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool
  and error string matching in worker.rs

- HandlerMode enum (Predict/Train) replacing is_train boolean in
  worker_bridge.rs

- SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields
  with state machine in worker_bridge.rs

Benefits:
- Invalid states impossible at compile time
- Eliminates string matching for cancellation
- Clearer semantics and self-documenting code
- Easier to extend with new variants

Addresses PR #2641 review feedback.
@markphelps markphelps requested a review from a team as a code owner January 20, 2026 19:06
@markphelps markphelps changed the base branch from main to FFI January 20, 2026 19:07
Use derive(Default) with #[default] attribute instead of manual impl.
Replace direct field access (.success, .error) with pattern matching
on the PredictionOutcome enum.
Copy link

@mfainberg-cf mfainberg-cf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, one in-line comment on a if/else branch that seems superfluoyus

self.start_prediction(slot, is_async);
if is_async {
// Note: we'll update with the actual future later
self.start_sync_prediction(slot); // Temporary - will be replaced when we get the future

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the comment that we're replacing with the future, but is there a reason for the if/else if we call self.start_sync_prediction(slot) in either case?

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
@markphelps markphelps merged commit 004862b into FFI Jan 20, 2026
9 checks passed
@markphelps markphelps deleted the FFI-enums branch January 20, 2026 20:01
tempusfrangit pushed a commit that referenced this pull request Jan 21, 2026
* refactor: replace boolean flags with Rust enums

Replace boolean soup with strongly-typed enums across four areas:

- PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags
  in predictor.rs (is_async, is_async_gen, has_train, is_train_async,
  is_standalone_function)

- PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool
  and error string matching in worker.rs

- HandlerMode enum (Predict/Train) replacing is_train boolean in
  worker_bridge.rs

- SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields
  with state machine in worker_bridge.rs

Benefits:
- Invalid states impossible at compile time
- Eliminates string matching for cancellation
- Clearer semantics and self-documenting code
- Easier to extend with new variants

Addresses PR #2641 review feedback.

* fix: apply clippy suggestions

Use derive(Default) with #[default] attribute instead of manual impl.

* style: apply rustfmt formatting

* test: update tests to use PredictionOutcome enum

Replace direct field access (.success, .error) with pattern matching
on the PredictionOutcome enum.

* chore: rm deadcode, simplify start_sync_prediction logic

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>

---------

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
tempusfrangit pushed a commit that referenced this pull request Jan 21, 2026
* refactor: replace boolean flags with Rust enums

Replace boolean soup with strongly-typed enums across four areas:

- PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags
  in predictor.rs (is_async, is_async_gen, has_train, is_train_async,
  is_standalone_function)

- PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool
  and error string matching in worker.rs

- HandlerMode enum (Predict/Train) replacing is_train boolean in
  worker_bridge.rs

- SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields
  with state machine in worker_bridge.rs

Benefits:
- Invalid states impossible at compile time
- Eliminates string matching for cancellation
- Clearer semantics and self-documenting code
- Easier to extend with new variants

Addresses PR #2641 review feedback.

* fix: apply clippy suggestions

Use derive(Default) with #[default] attribute instead of manual impl.

* style: apply rustfmt formatting

* test: update tests to use PredictionOutcome enum

Replace direct field access (.success, .error) with pattern matching
on the PredictionOutcome enum.

* chore: rm deadcode, simplify start_sync_prediction logic

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>

---------

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
tempusfrangit added a commit that referenced this pull request Jan 23, 2026
* Ignore local agent and mise configuration files

* Add Rust tooling tasks to mise configuration

Tasks for coglet development:
- build:rust / build:rust:release - Build Rust crates
- build:coglet - Build Python wheel with maturin (dev)
- build:coglet:wheel* - Build release wheels for various platforms
- test:rust - Run Rust tests with nextest
- test:coglet-python - Run Python integration tests
- fmt:rust / fmt:rust:check - Format Rust code
- clippy - Run lints
- deny - Check licenses/advisories

Tools added: cargo-binstall, cargo-deny, cargo-insta, cargo-nextest,
maturin, ruff, ty

* Add Cargo workspace for Rust crates

Workspace structure:
- crates/Cargo.toml - workspace root with shared dependencies
- crates/coglet/ - main Rust library (empty scaffold)
- crates/deny.toml - cargo-deny license/advisory config
- crates/.gitignore - ignore build artifacts

Shared workspace dependencies defined for async runtime (tokio),
HTTP (axum, reqwest), serialization (serde), and error handling.

* Add coglet dependencies to Cargo.toml

Runtime: tokio, tokio-util (codec), futures, async-trait
Serialization: serde, serde_json
HTTP: axum (server), reqwest + ureq (webhooks)
Utils: uuid, chrono, thiserror, anyhow, dashmap, tracing
Unix: nix (signal handling)
Dev: insta, wiremock, tower, http-body-util

* Add health and version foundation types

Health enum: Unknown, Starting, Ready, Busy, SetupFailed, Defunct
SetupStatus enum with SetupResult for tracking setup phase
VersionInfo for runtime version reporting

Includes serde serialization with appropriate casing and tests.

* Add IPC bridge protocol and codec

Protocol types for parent-worker communication:
- ControlRequest/Response for control channel (cancel, shutdown, ready)
- SlotRequest/Response for per-slot data channel (predict, logs, output)
- SlotId using UUID for unique slot identification
- SlotOutcome for type-safe completion handling

JsonCodec wraps LengthDelimitedCodec with serde_json serialization.

* Add Unix socket transport for slot IPC

NamedSocketTransport: filesystem sockets (all Unix platforms)
AbstractSocketTransport: Linux abstract namespace (no filesystem)

Adds ControlRequest::Init variant for worker initialization with
transport info, predictor ref, slot count, and async/train flags.

* Add Prediction state tracking and PredictionOutput

PredictionStatus enum with terminal state detection.
PredictionOutput for single values or streamed chunks.
Prediction struct tracks lifecycle: status, logs, outputs, timing.
Completion notification via tokio Notify for async waiting.
CancellationToken re-exported from tokio_util for cancel propagation.

* Add predictor traits and prediction lifecycle types

PredictionResult aggregates output, timing, and logs.
PredictionGuard provides RAII timing with cancellation support.
PredictionMetrics for timing data collection.
PredictionError for typed prediction failures.
PredictFn/AsyncPredictFn type aliases for predictor function signatures.

Also fixes PredictionOutput to use untagged serde serialization.

* Add permit pool with typestate for slot management

PermitInUse/PermitIdle/PermitPoisoned enforce valid state transitions
at compile time - poisoned permits cannot become idle.

PermitPool manages slot permits with RAII semantics:
- Idle permits return to pool on drop
- Poisoned permits are orphaned, reducing capacity

AnyPermit enum for dynamic state storage.

* Add PredictionSlot combining Prediction and Permit

PredictionSlot holds both together with separate concerns:
- Prediction behind Mutex for concurrent updates
- Permit for RAII pool return on drop

Slot transitions: into_idle() returns permit, into_poisoned() orphans it.
Drop handler fails non-terminal predictions if slot leaked.

* Add webhook sender with throttling and retry

WebhookEventType enum for filtering (start, output, logs, completed).
WebhookConfig with throttle interval, retry settings, status codes.
TraceContext for W3C distributed tracing headers.

WebhookSender implements:
- send(): fire-and-forget for non-terminal events with throttling
- send_terminal(): async with exponential backoff retries
- send_terminal_sync(): blocking version for Drop contexts (uses ureq)

Bearer auth via WEBHOOK_AUTH_TOKEN env var.

* Add PredictionSupervisor for lifecycle management

PredictionState snapshot for API responses with to_response() builder.
PredictionHandle for waiting, state queries, and cancellation.
SyncPredictionGuard for cancel-on-drop behavior in sync predictions.

DashMap provides lock-free concurrent access - no deadlock risks.
Terminal status triggers webhook send via spawned task.

* Add Orchestrator for worker subprocess lifecycle

* Add PredictionService for transport-agnostic prediction lifecycle

* Add HTTP transport layer with axum server and routes

* Add worker subprocess types for PyO3 bindings

* Add coglet-python crate with PyO3 bindings

* Add Python input/output processing infrastructure

- input.rs: Runtime detection (Pydantic vs Coglet), URLPath download with
  parallel ThreadPoolExecutor, PreparedInput RAII for temp file cleanup
- output.rs: Output processing using cog.json.make_encodeable() and
  upload_files() for base64 data URL conversion
- audit.rs: TeeWriter for protecting sys.stdout/stderr with audit hooks
  that wrap user replacements while preserving slot routing
- log_writer.rs: Full SlotLogWriter with ContextVar-based prediction
  routing, line buffering, setup log sender for health-check logs
- cancel.rs: SIGUSR1 signal handling with CancelationException,
  CancelableGuard RAII for sync prediction cancellation

* Add full predictor and worker bridge implementation

- predictor.rs: Complete PythonPredictor with runtime detection,
  async/sync support, generator handling, input preparation,
  output processing, and schema generation
- worker_bridge.rs: Full PredictHandler with asyncio event loop,
  per-slot cancellation tracking, log context management,
  SIGUSR1 signal cancellation for sync, future.cancel() for async

* Add type stubs and Python tests

- coglet.pyi: Type stubs for active(), serve(), _run_worker(), _is_cancelable()
- tests/test_coglet.py: Integration tests for sync/async/generator predictors,
  health check, cancellation endpoints

* Add setup log hook and audit helper exports to coglet module

- Setup log hook for routing setup logs via SlotLogWriter
- Export audit helpers (_is_slot_log_writer, _is_tee_writer, etc.)
- Export SlotLogWriter and TeeWriter classes for isinstance checks

* Upgrade PyO3 from 0.24 to 0.27

- Use abi3-py38 feature for stable ABI compatibility
- Replace Python::with_gil with Python::attach (0.27 API)
- Replace py.allow_threads with py.detach (0.27 API)
- Add auto-initialize feature for standalone testing

* Install log writers and audit hook in worker startup

Add missing install_slot_log_writers() and install_audit_hook() calls
in _run_worker() to enable stdout/stderr routing and protection.

Remove module-level #![allow(dead_code)] since code is now integrated.

* Add SetupError type for typed setup failures

Replace Result<(), String> in PredictHandler::setup() with typed
SetupError enum. Variants distinguish load vs setup vs internal errors,
enabling future error code mapping without changing behavior.

* Add architecture documentation for coglet crates

- crates/README.md: Overall architecture, prediction flow, startup
  sequence, bridge protocol, directory structure
- crates/coglet/README.md: Detailed coglet internals, components,
  health states, behaviors (shutdown, cancellation, slot poisoning)
- crates/coglet-python/README.md: Python bindings details, active()
  flag, single async loop, stdout/stderr routing, audit hook, cancel

* Remove dead predict_fn code path from service

Production always uses orchestrator. The predict_fn/async_predict_fn
path was only used in tests and never ran in production.

- Remove PredictFn, AsyncPredictFn, PredictFuture types
- Remove predict_via_function method
- Remove legacy_pool field
- Remove PredictionService::new(pool) constructor
- Enforce READY requires orchestrator (silently ignores otherwise)
- Update tests to not create impossible states

* Extract Orchestrator trait for testable service layer

Introduce Orchestrator trait with register_prediction() method:
- Enables mocking orchestrator in service unit tests
- OrchestratorHandle implements trait, keeps cancel/shutdown as inherent methods
- Service now takes Arc<dyn Orchestrator> instead of Arc<OrchestratorHandle>
- Export trait from coglet crate for downstream use

* Add comprehensive tests using MockOrchestrator

- Add MockOrchestrator in both service and routes test modules
- Service tests: orchestrator integration, prediction flow, capacity
- HTTP routes tests: full request/response cycle with mock predictor
- Fix race in predict() by checking terminal before waiting on notify
- 15 new tests covering prediction lifecycle end-to-end

* Remove panics from permit/slot.rs state transitions

- into_idle() now returns Option<IdleToken> instead of panicking
- into_poisoned() now returns bool instead of panicking
- Invalid state transitions log errors and use debug_assert for dev builds
- Callers updated to handle new return types gracefully

* Remove panics from orchestrator.rs

- stdin/stdout take() now returns OrchestratorError instead of panicking
- Prediction mutex locks use try_lock_prediction() helper that:
  - On poison: fails the prediction, logs error, returns None
  - Caller removes poisoned predictions from tracking
  - Further data from poisoned predictions is ignored

* Remove panics from service.rs prediction locks

- Add try_lock_prediction() helper that fails prediction on mutex poison
- predict() returns PredictionError on mutex poison instead of panicking

* Remove panics from coglet-python setup and mutex locks

- worker_bridge: init_async_loop() and PythonPredictHandler::new() return Result
- input: detect_runtime() returns Result, PreparedInput uses Py<PyDict>
- log_writer: mutex locks use expect() with clear messages
- predictor: handle detect_runtime Result propagation

Worker mutex poisoning now panics with clear message - orchestrator
handles worker exit and fails predictions appropriately.

* Remove panics from webhook, routes, and server modules

- webhook: WebhookSender::new() returns Result for HTTP client creation
- webhook: Mutex locks use into_inner() to recover from poison
- routes: Document timestamp unwrap safety (can't fail after UNIX_EPOCH)
- server: Improve signal handler expect messages for clarity

Callers handle webhook creation failure by logging and continuing without
webhooks, which is acceptable graceful degradation.

* Replace remaining panics with PyErr in coglet-python

- input.rs: get_item().unwrap() -> ok_or_else(PyKeyError) for missing keys
- log_writer.rs: OnceLock.get().unwrap() -> ok_or_else(PyRuntimeError)

Both are now proper Python exceptions rather than Rust panics.

* Add Rust CI workflow and stub generation

- Add .github/workflows/rust.yaml for Rust CI (fmt, clippy, nextest)
- Add mise tasks: stubs:generate, stubs:check, stubs:typecheck
- Add mise tasks: ci:rust, ci:coglet for local CI simulation
- Add scripts/generate_stubs.py to auto-generate coglet.pyi from module
- Update coglet.pyi with all module exports (classes + internal functions)

CI runs:
- Pure Rust checks (fmt, clippy, tests) without Python
- coglet-python checks on Python 3.9/3.12/3.13 (build, stub check, typecheck)

* Apply cargo fmt formatting

* Fix Rust CI: add cargo-binstall, uv sync for Python

- Install cargo-binstall before mise so mise uses binstall for cargo tools
- Add uv sync to setup Python environment before maturin build
- Order: Rust -> cargo-binstall -> rust-cache -> mise -> uv sync

* Fix CI: resolve checkout and virtualenv issues

- Add #[allow(dead_code)] on AbstractSocketTransport.prefix (kept for debugging)
- Add 'uv venv' step before maturin build in CI
- Move virtualenv PATH setup to individual steps after checkout to prevent
  "Unable to locate executable file: tar" error during actions/checkout@v4

* Replace boolean flags with Rust enums (#2643)

* refactor: replace boolean flags with Rust enums

Replace boolean soup with strongly-typed enums across four areas:

- PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags
  in predictor.rs (is_async, is_async_gen, has_train, is_train_async,
  is_standalone_function)

- PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool
  and error string matching in worker.rs

- HandlerMode enum (Predict/Train) replacing is_train boolean in
  worker_bridge.rs

- SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields
  with state machine in worker_bridge.rs

Benefits:
- Invalid states impossible at compile time
- Eliminates string matching for cancellation
- Clearer semantics and self-documenting code
- Easier to extend with new variants

Addresses PR #2641 review feedback.

* fix: apply clippy suggestions

Use derive(Default) with #[default] attribute instead of manual impl.

* style: apply rustfmt formatting

* test: update tests to use PredictionOutcome enum

Replace direct field access (.success, .error) with pattern matching
on the PredictionOutcome enum.

* chore: rm deadcode, simplify start_sync_prediction logic

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>

---------

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>

* Add Rust coglet integration to CI tests (#2644)

* cog-dataclass: fix Rust server predictor configuration

Fix http.py to use config.get_predictor_ref() instead of non-existent
config.predictor attribute. This properly loads the predictor reference
from cog.yaml for the Rust coglet server.

Changes:
- Use config.get_predictor_ref(args.mode) to get predictor ref
- Add helpful error handling with example if predictor not configured
- Exit cleanly with error code 1 instead of crashing

* Enable Rust coglet server in integration tests

Wire up the Rust coglet server (coglet-python) for integration testing
alongside both cog and cog-dataclass Python SDKs.

CI changes:
- Build Rust coglet wheel (ABI3) in build-python job
- Update test-integration matrix to 5 combinations:
  * cog (Python server)
  * cog-dataclass (Python server)
  * coglet-alpha (old Go server)
  * cog-rust (cog + Rust server)
  * cog-dataclass-rust (cog-dataclass + Rust server)
- Set COG_WHEEL and COGLET_RUST_WHEEL appropriately per runtime

Integration test harness:
- Propagate COGLET_RUST_WHEEL environment variable to tests
- Add conditions: cog_rust, cog_dataclass_rust, coglet_rust
- Allow tests to conditionally run based on server implementation

Dockerfile generation:
- Prevent installing Rust coglet when using coglet-alpha
- Add clear warning when both are specified
- Do not install Rust coglet for monobase/fast builds

Rust coglet:
- Add SdkImplementation enum (Pydantic, Dataclass, Unknown)
- Detect SDK type by checking for PYDANTIC_V2 attribute
- Log SDK implementation at startup for visibility

This allows us to test the Rust server with both SDK implementations
and ensure compatibility across the matrix.

* Clarify SDK detection: PYDANTIC_V2 exists in both v1 and v2

Add comment explaining that PYDANTIC_V2 attribute exists in both
pydantic v1 (set to False) and pydantic v2 (set to True), but does
not exist at all in dataclass-based cog.

This means the presence of the attribute (regardless of value)
indicates pydantic-based cog, while absence indicates dataclass-based.

* Exclude Rust coglet wheel from Go binary embedding

Only embed the Python coglet wheel (py3-none-any) in the Go binary.
The Rust coglet wheel (cp38-abi3-*) is distributed separately and
installed via COGLET_RUST_WHEEL environment variable at runtime.

This prevents the "expected exactly 1 coglet wheel embedded, found 2"
panic when both wheels are present in dist/.

* Fix type checking errors in cog http.py for Rust coglet integration

- Replace config.predictor with config.get_predictor_ref(args.mode)
- Add proper error handling for missing predictor configuration
- Add type: ignore comments for coglet attributes (conditional import)

This matches the fixes already applied to cog-dataclass and resolves
the pyright type checking failures in CI lint job.

* Remove embedded Python coglet wheel to avoid package name conflicts

The Python coglet wheel (py3-none-any) conflicts with the Rust coglet
wheel (cp38-abi3) as both have package name "coglet". Removed Python
coglet wheel completely:

- Remove Python coglet wheel build from CI
- Remove coglet wheel from Go binary embedding (only cog + cog-dataclass)
- Remove installEmbeddedCogletWheel() function
- Remove ReadCogletWheel() function
- Remove WheelSourceCogletEmbedded switch case (falls through to error)

Users should now use:
- COG_WHEEL=cog or cog-dataclass (Python servers)
- COG_WHEEL=coglet-alpha (Go implementation, downloaded from URL)
- COGLET_RUST_WHEEL=path (Rust server, installed alongside cog/cog-dataclass)

If someone sets COG_WHEEL=coglet, they'll get "unknown wheel source: coglet"
error instead of a panic.

* Remove coglet wheel test references

- Delete TestReadCogletWheel (function no longer exists)
- Delete TestCOGWheelEnvCoglet (embedded coglet removed)
- Delete TestCOGWheelCogletWithPython39Succeeds (embedded coglet removed)
- Remove WheelSourceCogletEmbedded from TestWheelSourceString
- Remove "coglet keyword" and "coglet uppercase" test cases
- Remove "env coglet overrides" test case

Tests now only cover: cog, cog-dataclass, coglet-alpha, file paths, and URLs.

* Fix Path handling for data URLs in cog-dataclass

- Use Path.validate() instead of Path() constructor to properly convert
  data URLs to URLPath objects
- Fix PredictionRequest.dict() to avoid asdict() serialization issues
  with URLPath objects

This fixes path_input and path_list_input integration tests where data
URLs were being passed as raw strings instead of being converted to
file paths.

* cog-dataclass: allow non-BasePredictor classes and add tests

- Accept predictor classes without BasePredictor inheritance
- Keep setup optional; validate if present
- Add unit tests for permissive create_predictor behavior
- Add integration tests for class/function predictors (skip coglet_alpha)

* coglet: detect non-pydantic runtime via cog-dataclass inspector

- Detect cog-dataclass predictors via cog._inspector.create_predictor
- Use non-pydantic input processor with URLPath downloads
- Generate schemas via cog-dataclass _schemas for non-pydantic runtime
- Remove legacy coglet runtime detection path

* cog-dataclass: accept local paths in Path.validate

Keep URL inputs as URLPath while restoring local path normalization.

* coglet: fix spurious error log during normal shutdown

Send ControlRequest::Shutdown to workers before exiting to avoid
"Control channel closed (parent died?)" error message during normal
shutdown.

Changes:
- Add shutdown() to Orchestrator trait
- Implement shutdown() in OrchestratorHandle trait impl
- Add PredictionService.shutdown() to call orchestrator.shutdown()
- Call service.shutdown() in HTTP server after graceful shutdown
- Update MockOrchestrator test doubles with no-op shutdown()

Before: Workers saw channel close and logged spurious error
After: Workers receive shutdown message and exit cleanly

* coglet: use manylinux2014 for Python 3.8+ compatibility

Change from auto-detected manylinux_2_39 to manylinux2014 (glibc 2.17)
to ensure the coglet wheel installs on Python 3.8 base images.

* Linting fixes

* Use rustls with native OS certificate stores

Configure both HTTP clients to use operating system certificate stores:

- reqwest: Use rustls-tls-native-roots feature (rustls-native-certs)
- ureq: Use rustls + platform-verifier with RootCerts::PlatformVerifier

This allows custom CA certificates to be supplied via the OS certificate
store (e.g., /etc/ssl/certs) instead of being hardcoded at compile time.

Trade-offs:
- Requires ca-certificates package in base images
- Explicit failure if certs missing (better than stale bundled certs)
- No diamond dependency issues (avoided native-tls)
- No unstable features (avoided rustls-no-provider)

Note: webpki-roots remains in dependency tree via ureq's rustls feature
but is not used at runtime due to explicit PlatformVerifier configuration.

* Allow CDLA-Permissive-2.0 license in cargo-deny

Add Community Data License Agreement Permissive 2.0 to the allow list
for webpki-roots certificate data. This is a permissive license compatible
with Apache-2.0, used for Mozilla's CA certificate bundle.

* Fix coglet-python test task to install dependencies

Use mise dir field and uv --with flags to install test dependencies
(pytest, requests) before running tests.

* CI: Add coglet-python tests and ensure rust+binstall before mise

- Add test-coglet-python job to run Python binding tests
- Install Rust toolchain and cargo-binstall before mise-action in all jobs
  to avoid compiling cargo tools from source (uses binstall instead)
- Ensures correct order: rust -> binstall -> mise -> rust-cache

* Fix coglet-python tests with dynamic port discovery and cog.yaml

- Add cog.yaml files to test fixtures (required by both cog and cog-dataclass)
- Implement dynamic port assignment using port 0 for tests
- Add background thread to capture stderr logs for port discovery
- Log actual bound port after TcpListener::bind() in coglet server
- Update mise task to install dependencies in proper venv
- All 12 coglet-python tests now passing

* Require Python 3.10+ for coglet (matching cog-dataclass)

- Update coglet-python requires-python from >=3.8 to >=3.10
- Change PyO3 ABI3 target from py38 to py310
- Update CI wheel paths from cp38-abi3 to cp310-abi3
- Matches cog-dataclass minimum Python version requirement

* Skip integration tests with Python <3.10 or monobase for coglet (Rust)

Add [coglet_rust] skip to integration tests that coglet cannot run:

Python <3.10 (9 tests):
- python37_deprecated.txtar (3.7)
- optional_input.txtar (3.8)
- setup_subprocess_multiprocessing.txtar (3.8)
- apt_packages.txtar (3.9)
- bad_dockerignore.txtar (3.9)
- torch_baseimage_fallback.txtar (3.9)
- torch_baseimage_no_cog_base.txtar (3.9)
- torch_baseimage_precompile.txtar (3.9)
- torch_cuda_baseimage.txtar (3.9)

Monobase/fast build (7 tests):
- build_base_image_sha.txtar
- build_cog_version_match.txtar
- build_localimage.txtar
- build_python313_base_image.txtar
- predict_localimage.txtar
- fast_build.txtar
- run_fast_build.txtar

coglet (Rust) requires Python >=3.10 and does not support monobase.

* CI: Use zig cross-compilation for Linux coglet wheel build

Use build:coglet:wheel:linux instead of build:coglet:wheel to ensure
manylinux2014 (GLIBC 2.17) compatibility. Building on modern Linux
systems picks up too-recent GLIBC symbols (2.18-2.39) which breaks
manylinux compliance. Zig cross-compilation targets the correct GLIBC version.

* fixup! CI: Add coglet-python tests and ensure rust+binstall before mise

Use astral-sh/setup-uv for proper venv management

* Fix Secret serialization and skip pydantic2_output for cog-dataclass

- Fix Secret.json_encode() to unwrap Secret objects before JSON serialization
  This fixes 'TypeError: Object of type Secret is not JSON serializable'
  when generating OpenAPI schemas with Secret input fields that have defaults
- Skip pydantic2_output test for cog-dataclass/coglet_rust since it explicitly
  tests Pydantic 2 BaseModel features

* Fix OpenAPI schema generation for training mode in coglet

- Make cog-dataclass to_json_schema() mode-aware:
  - Accept Mode parameter (PREDICT or TRAIN)
  - Generate appropriate routes (/predictions or /trainings)
  - Use correct schema names (PredictionRequest/Response or TrainingRequest/Response)

- Update coglet-python to pass HandlerMode through schema generation:
  - Pass mode from worker_bridge to predictor.schema()
  - Convert Rust HandlerMode enum to Python Mode enum
  - Pass mode to both pydantic (schema_via_fastapi) and cog-dataclass (to_json_schema)

This fixes the "missing output definition for /trainings" error when using
coglet with training endpoints, ensuring training has its own distinct
Input/Output schemas based on the train() method signature.

* Fix validation error messages to match pydantic format

- Change 'missing required input field' to 'Field required' in cog-dataclass
- Revert string_predictor.txtar test to expect 's: Field required'
- Fix mise.toml to use 'python -m pytest' to avoid hardcoded shebang issues

This ensures cog-dataclass validation errors match pydantic's format,
making the integration test pass.

* Fix subprocess output corrupting control channel in coglet

When Python's setup() spawns subprocesses (subprocess.Popen), they inherit
fd 1 (stdout) which is the control channel pipe to the orchestrator. Subprocess
output writes directly into the pipe, corrupting JSON framing with raw text.

This caused "frame size too big" errors when the bash subprocess in
setup_subprocess_simple.txtar wrote 100 lines during setup.

Solution: File descriptor redirection early in worker startup
- Move control channel to high-numbered fds (99-101)
- Replace fd 1/2 with capture pipes
- Spawn threads to read pipes and route output through log system
- Subprocess logs flow to coglet::user target (like orphan logs)
- Log forwarder runs for entire worker lifetime

Changes:
- Add crates/coglet/src/fd_redirect.rs with fd manipulation
- Update worker.rs to use redirected fds for control channel
- Add RUST_LOG propagation in CLI commands (serve, predict, run)
- Add RUST_LOG support in integration test harness
- Add frame size and log forwarding trace logging for debugging
- Route control channel logs to coglet::user target

Safety contracts documented for all unsafe blocks. Passes clippy with no warnings.
Fixes setup_subprocess_simple integration test.

* Fix validation error format to show simple field messages

Standardize validation error messages across cog-dataclass and coglet:
- cog-dataclass: Change "missing required input field" to "Field required"
- coglet: Parse Pydantic ValidationError to extract "field: message" format
- Both now output simple "field: message" instead of verbose error details

Fixes integration test string_predictor validation error matching.

* Fix TrainingOutput serialization for Pydantic v2 models

make_encodeable() now checks for model_dump() (Pydantic v2) before
falling back to dict() (Pydantic v1). This fixes train() functions
that return Pydantic BaseModel instances like TrainingOutput.

Fixes train_basic integration test.

* Fix validation error formatting in coglet to show simple messages

Parse Pydantic ValidationError to extract simple "field: message" format
instead of verbose error details. For ValueError (cog-dataclass), the
message is already in the correct format.

Includes clippy fixes for collapsible if statements.

Fixes string_predictor integration test validation error matching.

* Fix capture thread feedback loop and update to bounded channels

Capture thread improvements:
- Remove all tracing calls from capture threads to prevent feedback loop
  (stderr is redirected to capture pipe, so tracing would capture itself)
- Update comments to reflect bounded channel (500 messages) instead of unbounded
- Align safety comment with header documentation about tokio runtime threads

Log writer updates:
- Change SetupLogSender to use bounded Sender instead of UnboundedSender
- Use blocking_send() for backpressure handling

* Update cog-dataclass test to expect new validation error format

Change test to expect "Field required" instead of "missing required input field"
to match the updated error message format.

* Fix blocking panic and add dropped log tracking

Critical fixes:
1. Replace blocking_send() with try_send() in ControlChannelLogSender to prevent
   "Cannot block the current thread" panic when called from Python code on tokio
   runtime threads

2. Add DroppedLogs control message and periodic reporting (every 5s) to track
   logs dropped due to backpressure

3. Rename SetupLogSender -> ControlChannelLogSender with accurate documentation:
   - Lives for entire worker lifetime, not just setup
   - Handles both Python setup logs and subprocess output
   - Cleanup is no-op (sender persists)

4. Log dropped logs as warnings in orchestrator with human-readable intervals

Fixes integration test panics during setup phase.

* Ship worker tracing logs over IPC with preserved targets

Add custom tracing layer for worker subprocess that ships structured
tracing events over IPC to orchestrator, preserving component targets.

Changes:
- Add WorkerLog control message with target/level/message fields
- Implement WorkerTracingLayer that intercepts tracing events before formatting
- Initialize custom layer in worker with IPC channel sender
- Orchestrator re-emits logs with original target using low-level tracing API
- Optional fd 101 direct logging via RUST_WORKER_DIRECT_LOG=1 for debugging
- Increased channel depth 500 → 5000 for trace-level logging support

This replaces the previous approach where tracing went through capture pipes
with ANSI codes, losing component information and appearing as generic
'coglet::setup' source.

Note: Setuplog accumulation not yet implemented - logs are shipped and
re-emitted but not added to SetupResult. Will be addressed in follow-up.

* Prevent feedback loop in codec logging

Filter out coglet::bridge::codec logs from IPC shipping to prevent
infinite loop where encoding a WorkerLog message triggers codec trace
logging, which creates another WorkerLog, which triggers more encoding, etc.

Changes:
- WorkerTracingLayer skips shipping logs with codec target
- Added SAFETY comment in codec.rs explaining the feedback loop risk
- Codec logs still visible via RUST_WORKER_DIRECT_LOG=1 for debugging

* Accumulate coglet server setup logs via tracing layer

Adds SetupLogAccumulator tracing layer that captures all logs from the
coglet server process during the setup phase, from initial startup
through worker Ready.

Key implementation details:
- Tracing layer installed before any logs emitted (captures "coglet <version>" log)
- Accumulates orchestrator logs + re-emitted worker logs shipped over IPC
- Critical ordering: accumulate → Ready received → drain → flip Ready state
- Performance optimization: tx.is_closed() check prevents expensive ops after Ready
- Channel semantics: unbounded for setup phase, drainable even after close
- Log format: lines joined with \n separator, trailing newline at end

The accumulated logs are returned in SetupResult.logs and exposed via
/health-check endpoint for debugging setup failures.

* Ensure 'Server ready' log is accumulated before stopping

Keep the setup log receiver alive through the 'Server ready' log,
then do a final drain to capture it before setting the setup result.

The 'Setup complete, now accepting requests' log happens after the
receiver is dropped, so it correctly does NOT appear in setup logs.

Test verifies both behaviors:
- 'Server ready' IS in setup logs
- 'Setup complete' is NOT in setup logs

* Stabilize coglet bindings runtime detection

Split coglet bindings CI runs by runtime and align predictor ref
parsing with cog loader semantics.

- Parse predictor refs to use file stem instead of full dotted path
  (e.g., 'path/to/predict.py:Predictor' -> module='predict')
- Split CI matrix to test both cog-dataclass and cog runtimes separately
- Add mise tasks for testing each runtime independently
- Ensures proper module import behavior matching Python's loader

* Normalize coglet output serialization

Handle dataclass BaseModel outputs and unwrap pydantic iterators
before JSON encoding, plus rust-only setup log test guard.

- Add dataclass conversion in make_encodeable() for BaseModel outputs
- Unwrap pydantic serialization iterators in input processing and output serialization
- Guard setup_worker_tracing_logs.txtar test for rust-only coglet
- Add type safety check for numpy isinstance calls

---------

Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
Co-authored-by: Mark Phelps <209477+markphelps@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants