-
Notifications
You must be signed in to change notification settings - Fork 654
Replace boolean flags with Rust enums #2643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replace boolean soup with strongly-typed enums across four areas: - PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags in predictor.rs (is_async, is_async_gen, has_train, is_train_async, is_standalone_function) - PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool and error string matching in worker.rs - HandlerMode enum (Predict/Train) replacing is_train boolean in worker_bridge.rs - SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields with state machine in worker_bridge.rs Benefits: - Invalid states impossible at compile time - Eliminates string matching for cancellation - Clearer semantics and self-documenting code - Easier to extend with new variants Addresses PR #2641 review feedback.
Use derive(Default) with #[default] attribute instead of manual impl.
Replace direct field access (.success, .error) with pattern matching on the PredictionOutcome enum.
mfainberg-cf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, one in-line comment on a if/else branch that seems superfluoyus
| self.start_prediction(slot, is_async); | ||
| if is_async { | ||
| // Note: we'll update with the actual future later | ||
| self.start_sync_prediction(slot); // Temporary - will be replaced when we get the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the comment that we're replacing with the future, but is there a reason for the if/else if we call self.start_sync_prediction(slot) in either case?
Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
* refactor: replace boolean flags with Rust enums Replace boolean soup with strongly-typed enums across four areas: - PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags in predictor.rs (is_async, is_async_gen, has_train, is_train_async, is_standalone_function) - PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool and error string matching in worker.rs - HandlerMode enum (Predict/Train) replacing is_train boolean in worker_bridge.rs - SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields with state machine in worker_bridge.rs Benefits: - Invalid states impossible at compile time - Eliminates string matching for cancellation - Clearer semantics and self-documenting code - Easier to extend with new variants Addresses PR #2641 review feedback. * fix: apply clippy suggestions Use derive(Default) with #[default] attribute instead of manual impl. * style: apply rustfmt formatting * test: update tests to use PredictionOutcome enum Replace direct field access (.success, .error) with pattern matching on the PredictionOutcome enum. * chore: rm deadcode, simplify start_sync_prediction logic Signed-off-by: Mark Phelps <mphelps@cloudflare.com> --------- Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
* refactor: replace boolean flags with Rust enums Replace boolean soup with strongly-typed enums across four areas: - PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags in predictor.rs (is_async, is_async_gen, has_train, is_train_async, is_standalone_function) - PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool and error string matching in worker.rs - HandlerMode enum (Predict/Train) replacing is_train boolean in worker_bridge.rs - SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields with state machine in worker_bridge.rs Benefits: - Invalid states impossible at compile time - Eliminates string matching for cancellation - Clearer semantics and self-documenting code - Easier to extend with new variants Addresses PR #2641 review feedback. * fix: apply clippy suggestions Use derive(Default) with #[default] attribute instead of manual impl. * style: apply rustfmt formatting * test: update tests to use PredictionOutcome enum Replace direct field access (.success, .error) with pattern matching on the PredictionOutcome enum. * chore: rm deadcode, simplify start_sync_prediction logic Signed-off-by: Mark Phelps <mphelps@cloudflare.com> --------- Signed-off-by: Mark Phelps <mphelps@cloudflare.com>
* Ignore local agent and mise configuration files * Add Rust tooling tasks to mise configuration Tasks for coglet development: - build:rust / build:rust:release - Build Rust crates - build:coglet - Build Python wheel with maturin (dev) - build:coglet:wheel* - Build release wheels for various platforms - test:rust - Run Rust tests with nextest - test:coglet-python - Run Python integration tests - fmt:rust / fmt:rust:check - Format Rust code - clippy - Run lints - deny - Check licenses/advisories Tools added: cargo-binstall, cargo-deny, cargo-insta, cargo-nextest, maturin, ruff, ty * Add Cargo workspace for Rust crates Workspace structure: - crates/Cargo.toml - workspace root with shared dependencies - crates/coglet/ - main Rust library (empty scaffold) - crates/deny.toml - cargo-deny license/advisory config - crates/.gitignore - ignore build artifacts Shared workspace dependencies defined for async runtime (tokio), HTTP (axum, reqwest), serialization (serde), and error handling. * Add coglet dependencies to Cargo.toml Runtime: tokio, tokio-util (codec), futures, async-trait Serialization: serde, serde_json HTTP: axum (server), reqwest + ureq (webhooks) Utils: uuid, chrono, thiserror, anyhow, dashmap, tracing Unix: nix (signal handling) Dev: insta, wiremock, tower, http-body-util * Add health and version foundation types Health enum: Unknown, Starting, Ready, Busy, SetupFailed, Defunct SetupStatus enum with SetupResult for tracking setup phase VersionInfo for runtime version reporting Includes serde serialization with appropriate casing and tests. * Add IPC bridge protocol and codec Protocol types for parent-worker communication: - ControlRequest/Response for control channel (cancel, shutdown, ready) - SlotRequest/Response for per-slot data channel (predict, logs, output) - SlotId using UUID for unique slot identification - SlotOutcome for type-safe completion handling JsonCodec wraps LengthDelimitedCodec with serde_json serialization. * Add Unix socket transport for slot IPC NamedSocketTransport: filesystem sockets (all Unix platforms) AbstractSocketTransport: Linux abstract namespace (no filesystem) Adds ControlRequest::Init variant for worker initialization with transport info, predictor ref, slot count, and async/train flags. * Add Prediction state tracking and PredictionOutput PredictionStatus enum with terminal state detection. PredictionOutput for single values or streamed chunks. Prediction struct tracks lifecycle: status, logs, outputs, timing. Completion notification via tokio Notify for async waiting. CancellationToken re-exported from tokio_util for cancel propagation. * Add predictor traits and prediction lifecycle types PredictionResult aggregates output, timing, and logs. PredictionGuard provides RAII timing with cancellation support. PredictionMetrics for timing data collection. PredictionError for typed prediction failures. PredictFn/AsyncPredictFn type aliases for predictor function signatures. Also fixes PredictionOutput to use untagged serde serialization. * Add permit pool with typestate for slot management PermitInUse/PermitIdle/PermitPoisoned enforce valid state transitions at compile time - poisoned permits cannot become idle. PermitPool manages slot permits with RAII semantics: - Idle permits return to pool on drop - Poisoned permits are orphaned, reducing capacity AnyPermit enum for dynamic state storage. * Add PredictionSlot combining Prediction and Permit PredictionSlot holds both together with separate concerns: - Prediction behind Mutex for concurrent updates - Permit for RAII pool return on drop Slot transitions: into_idle() returns permit, into_poisoned() orphans it. Drop handler fails non-terminal predictions if slot leaked. * Add webhook sender with throttling and retry WebhookEventType enum for filtering (start, output, logs, completed). WebhookConfig with throttle interval, retry settings, status codes. TraceContext for W3C distributed tracing headers. WebhookSender implements: - send(): fire-and-forget for non-terminal events with throttling - send_terminal(): async with exponential backoff retries - send_terminal_sync(): blocking version for Drop contexts (uses ureq) Bearer auth via WEBHOOK_AUTH_TOKEN env var. * Add PredictionSupervisor for lifecycle management PredictionState snapshot for API responses with to_response() builder. PredictionHandle for waiting, state queries, and cancellation. SyncPredictionGuard for cancel-on-drop behavior in sync predictions. DashMap provides lock-free concurrent access - no deadlock risks. Terminal status triggers webhook send via spawned task. * Add Orchestrator for worker subprocess lifecycle * Add PredictionService for transport-agnostic prediction lifecycle * Add HTTP transport layer with axum server and routes * Add worker subprocess types for PyO3 bindings * Add coglet-python crate with PyO3 bindings * Add Python input/output processing infrastructure - input.rs: Runtime detection (Pydantic vs Coglet), URLPath download with parallel ThreadPoolExecutor, PreparedInput RAII for temp file cleanup - output.rs: Output processing using cog.json.make_encodeable() and upload_files() for base64 data URL conversion - audit.rs: TeeWriter for protecting sys.stdout/stderr with audit hooks that wrap user replacements while preserving slot routing - log_writer.rs: Full SlotLogWriter with ContextVar-based prediction routing, line buffering, setup log sender for health-check logs - cancel.rs: SIGUSR1 signal handling with CancelationException, CancelableGuard RAII for sync prediction cancellation * Add full predictor and worker bridge implementation - predictor.rs: Complete PythonPredictor with runtime detection, async/sync support, generator handling, input preparation, output processing, and schema generation - worker_bridge.rs: Full PredictHandler with asyncio event loop, per-slot cancellation tracking, log context management, SIGUSR1 signal cancellation for sync, future.cancel() for async * Add type stubs and Python tests - coglet.pyi: Type stubs for active(), serve(), _run_worker(), _is_cancelable() - tests/test_coglet.py: Integration tests for sync/async/generator predictors, health check, cancellation endpoints * Add setup log hook and audit helper exports to coglet module - Setup log hook for routing setup logs via SlotLogWriter - Export audit helpers (_is_slot_log_writer, _is_tee_writer, etc.) - Export SlotLogWriter and TeeWriter classes for isinstance checks * Upgrade PyO3 from 0.24 to 0.27 - Use abi3-py38 feature for stable ABI compatibility - Replace Python::with_gil with Python::attach (0.27 API) - Replace py.allow_threads with py.detach (0.27 API) - Add auto-initialize feature for standalone testing * Install log writers and audit hook in worker startup Add missing install_slot_log_writers() and install_audit_hook() calls in _run_worker() to enable stdout/stderr routing and protection. Remove module-level #![allow(dead_code)] since code is now integrated. * Add SetupError type for typed setup failures Replace Result<(), String> in PredictHandler::setup() with typed SetupError enum. Variants distinguish load vs setup vs internal errors, enabling future error code mapping without changing behavior. * Add architecture documentation for coglet crates - crates/README.md: Overall architecture, prediction flow, startup sequence, bridge protocol, directory structure - crates/coglet/README.md: Detailed coglet internals, components, health states, behaviors (shutdown, cancellation, slot poisoning) - crates/coglet-python/README.md: Python bindings details, active() flag, single async loop, stdout/stderr routing, audit hook, cancel * Remove dead predict_fn code path from service Production always uses orchestrator. The predict_fn/async_predict_fn path was only used in tests and never ran in production. - Remove PredictFn, AsyncPredictFn, PredictFuture types - Remove predict_via_function method - Remove legacy_pool field - Remove PredictionService::new(pool) constructor - Enforce READY requires orchestrator (silently ignores otherwise) - Update tests to not create impossible states * Extract Orchestrator trait for testable service layer Introduce Orchestrator trait with register_prediction() method: - Enables mocking orchestrator in service unit tests - OrchestratorHandle implements trait, keeps cancel/shutdown as inherent methods - Service now takes Arc<dyn Orchestrator> instead of Arc<OrchestratorHandle> - Export trait from coglet crate for downstream use * Add comprehensive tests using MockOrchestrator - Add MockOrchestrator in both service and routes test modules - Service tests: orchestrator integration, prediction flow, capacity - HTTP routes tests: full request/response cycle with mock predictor - Fix race in predict() by checking terminal before waiting on notify - 15 new tests covering prediction lifecycle end-to-end * Remove panics from permit/slot.rs state transitions - into_idle() now returns Option<IdleToken> instead of panicking - into_poisoned() now returns bool instead of panicking - Invalid state transitions log errors and use debug_assert for dev builds - Callers updated to handle new return types gracefully * Remove panics from orchestrator.rs - stdin/stdout take() now returns OrchestratorError instead of panicking - Prediction mutex locks use try_lock_prediction() helper that: - On poison: fails the prediction, logs error, returns None - Caller removes poisoned predictions from tracking - Further data from poisoned predictions is ignored * Remove panics from service.rs prediction locks - Add try_lock_prediction() helper that fails prediction on mutex poison - predict() returns PredictionError on mutex poison instead of panicking * Remove panics from coglet-python setup and mutex locks - worker_bridge: init_async_loop() and PythonPredictHandler::new() return Result - input: detect_runtime() returns Result, PreparedInput uses Py<PyDict> - log_writer: mutex locks use expect() with clear messages - predictor: handle detect_runtime Result propagation Worker mutex poisoning now panics with clear message - orchestrator handles worker exit and fails predictions appropriately. * Remove panics from webhook, routes, and server modules - webhook: WebhookSender::new() returns Result for HTTP client creation - webhook: Mutex locks use into_inner() to recover from poison - routes: Document timestamp unwrap safety (can't fail after UNIX_EPOCH) - server: Improve signal handler expect messages for clarity Callers handle webhook creation failure by logging and continuing without webhooks, which is acceptable graceful degradation. * Replace remaining panics with PyErr in coglet-python - input.rs: get_item().unwrap() -> ok_or_else(PyKeyError) for missing keys - log_writer.rs: OnceLock.get().unwrap() -> ok_or_else(PyRuntimeError) Both are now proper Python exceptions rather than Rust panics. * Add Rust CI workflow and stub generation - Add .github/workflows/rust.yaml for Rust CI (fmt, clippy, nextest) - Add mise tasks: stubs:generate, stubs:check, stubs:typecheck - Add mise tasks: ci:rust, ci:coglet for local CI simulation - Add scripts/generate_stubs.py to auto-generate coglet.pyi from module - Update coglet.pyi with all module exports (classes + internal functions) CI runs: - Pure Rust checks (fmt, clippy, tests) without Python - coglet-python checks on Python 3.9/3.12/3.13 (build, stub check, typecheck) * Apply cargo fmt formatting * Fix Rust CI: add cargo-binstall, uv sync for Python - Install cargo-binstall before mise so mise uses binstall for cargo tools - Add uv sync to setup Python environment before maturin build - Order: Rust -> cargo-binstall -> rust-cache -> mise -> uv sync * Fix CI: resolve checkout and virtualenv issues - Add #[allow(dead_code)] on AbstractSocketTransport.prefix (kept for debugging) - Add 'uv venv' step before maturin build in CI - Move virtualenv PATH setup to individual steps after checkout to prevent "Unable to locate executable file: tar" error during actions/checkout@v4 * Replace boolean flags with Rust enums (#2643) * refactor: replace boolean flags with Rust enums Replace boolean soup with strongly-typed enums across four areas: - PredictorKind enum (Class/StandaloneFunction) replacing 5 boolean flags in predictor.rs (is_async, is_async_gen, has_train, is_train_async, is_standalone_function) - PredictionOutcome enum (Success/Failed/Cancelled) replacing success bool and error string matching in worker.rs - HandlerMode enum (Predict/Train) replacing is_train boolean in worker_bridge.rs - SlotState enum (Idle/SyncPrediction/AsyncPrediction) replacing 4 fields with state machine in worker_bridge.rs Benefits: - Invalid states impossible at compile time - Eliminates string matching for cancellation - Clearer semantics and self-documenting code - Easier to extend with new variants Addresses PR #2641 review feedback. * fix: apply clippy suggestions Use derive(Default) with #[default] attribute instead of manual impl. * style: apply rustfmt formatting * test: update tests to use PredictionOutcome enum Replace direct field access (.success, .error) with pattern matching on the PredictionOutcome enum. * chore: rm deadcode, simplify start_sync_prediction logic Signed-off-by: Mark Phelps <mphelps@cloudflare.com> --------- Signed-off-by: Mark Phelps <mphelps@cloudflare.com> * Add Rust coglet integration to CI tests (#2644) * cog-dataclass: fix Rust server predictor configuration Fix http.py to use config.get_predictor_ref() instead of non-existent config.predictor attribute. This properly loads the predictor reference from cog.yaml for the Rust coglet server. Changes: - Use config.get_predictor_ref(args.mode) to get predictor ref - Add helpful error handling with example if predictor not configured - Exit cleanly with error code 1 instead of crashing * Enable Rust coglet server in integration tests Wire up the Rust coglet server (coglet-python) for integration testing alongside both cog and cog-dataclass Python SDKs. CI changes: - Build Rust coglet wheel (ABI3) in build-python job - Update test-integration matrix to 5 combinations: * cog (Python server) * cog-dataclass (Python server) * coglet-alpha (old Go server) * cog-rust (cog + Rust server) * cog-dataclass-rust (cog-dataclass + Rust server) - Set COG_WHEEL and COGLET_RUST_WHEEL appropriately per runtime Integration test harness: - Propagate COGLET_RUST_WHEEL environment variable to tests - Add conditions: cog_rust, cog_dataclass_rust, coglet_rust - Allow tests to conditionally run based on server implementation Dockerfile generation: - Prevent installing Rust coglet when using coglet-alpha - Add clear warning when both are specified - Do not install Rust coglet for monobase/fast builds Rust coglet: - Add SdkImplementation enum (Pydantic, Dataclass, Unknown) - Detect SDK type by checking for PYDANTIC_V2 attribute - Log SDK implementation at startup for visibility This allows us to test the Rust server with both SDK implementations and ensure compatibility across the matrix. * Clarify SDK detection: PYDANTIC_V2 exists in both v1 and v2 Add comment explaining that PYDANTIC_V2 attribute exists in both pydantic v1 (set to False) and pydantic v2 (set to True), but does not exist at all in dataclass-based cog. This means the presence of the attribute (regardless of value) indicates pydantic-based cog, while absence indicates dataclass-based. * Exclude Rust coglet wheel from Go binary embedding Only embed the Python coglet wheel (py3-none-any) in the Go binary. The Rust coglet wheel (cp38-abi3-*) is distributed separately and installed via COGLET_RUST_WHEEL environment variable at runtime. This prevents the "expected exactly 1 coglet wheel embedded, found 2" panic when both wheels are present in dist/. * Fix type checking errors in cog http.py for Rust coglet integration - Replace config.predictor with config.get_predictor_ref(args.mode) - Add proper error handling for missing predictor configuration - Add type: ignore comments for coglet attributes (conditional import) This matches the fixes already applied to cog-dataclass and resolves the pyright type checking failures in CI lint job. * Remove embedded Python coglet wheel to avoid package name conflicts The Python coglet wheel (py3-none-any) conflicts with the Rust coglet wheel (cp38-abi3) as both have package name "coglet". Removed Python coglet wheel completely: - Remove Python coglet wheel build from CI - Remove coglet wheel from Go binary embedding (only cog + cog-dataclass) - Remove installEmbeddedCogletWheel() function - Remove ReadCogletWheel() function - Remove WheelSourceCogletEmbedded switch case (falls through to error) Users should now use: - COG_WHEEL=cog or cog-dataclass (Python servers) - COG_WHEEL=coglet-alpha (Go implementation, downloaded from URL) - COGLET_RUST_WHEEL=path (Rust server, installed alongside cog/cog-dataclass) If someone sets COG_WHEEL=coglet, they'll get "unknown wheel source: coglet" error instead of a panic. * Remove coglet wheel test references - Delete TestReadCogletWheel (function no longer exists) - Delete TestCOGWheelEnvCoglet (embedded coglet removed) - Delete TestCOGWheelCogletWithPython39Succeeds (embedded coglet removed) - Remove WheelSourceCogletEmbedded from TestWheelSourceString - Remove "coglet keyword" and "coglet uppercase" test cases - Remove "env coglet overrides" test case Tests now only cover: cog, cog-dataclass, coglet-alpha, file paths, and URLs. * Fix Path handling for data URLs in cog-dataclass - Use Path.validate() instead of Path() constructor to properly convert data URLs to URLPath objects - Fix PredictionRequest.dict() to avoid asdict() serialization issues with URLPath objects This fixes path_input and path_list_input integration tests where data URLs were being passed as raw strings instead of being converted to file paths. * cog-dataclass: allow non-BasePredictor classes and add tests - Accept predictor classes without BasePredictor inheritance - Keep setup optional; validate if present - Add unit tests for permissive create_predictor behavior - Add integration tests for class/function predictors (skip coglet_alpha) * coglet: detect non-pydantic runtime via cog-dataclass inspector - Detect cog-dataclass predictors via cog._inspector.create_predictor - Use non-pydantic input processor with URLPath downloads - Generate schemas via cog-dataclass _schemas for non-pydantic runtime - Remove legacy coglet runtime detection path * cog-dataclass: accept local paths in Path.validate Keep URL inputs as URLPath while restoring local path normalization. * coglet: fix spurious error log during normal shutdown Send ControlRequest::Shutdown to workers before exiting to avoid "Control channel closed (parent died?)" error message during normal shutdown. Changes: - Add shutdown() to Orchestrator trait - Implement shutdown() in OrchestratorHandle trait impl - Add PredictionService.shutdown() to call orchestrator.shutdown() - Call service.shutdown() in HTTP server after graceful shutdown - Update MockOrchestrator test doubles with no-op shutdown() Before: Workers saw channel close and logged spurious error After: Workers receive shutdown message and exit cleanly * coglet: use manylinux2014 for Python 3.8+ compatibility Change from auto-detected manylinux_2_39 to manylinux2014 (glibc 2.17) to ensure the coglet wheel installs on Python 3.8 base images. * Linting fixes * Use rustls with native OS certificate stores Configure both HTTP clients to use operating system certificate stores: - reqwest: Use rustls-tls-native-roots feature (rustls-native-certs) - ureq: Use rustls + platform-verifier with RootCerts::PlatformVerifier This allows custom CA certificates to be supplied via the OS certificate store (e.g., /etc/ssl/certs) instead of being hardcoded at compile time. Trade-offs: - Requires ca-certificates package in base images - Explicit failure if certs missing (better than stale bundled certs) - No diamond dependency issues (avoided native-tls) - No unstable features (avoided rustls-no-provider) Note: webpki-roots remains in dependency tree via ureq's rustls feature but is not used at runtime due to explicit PlatformVerifier configuration. * Allow CDLA-Permissive-2.0 license in cargo-deny Add Community Data License Agreement Permissive 2.0 to the allow list for webpki-roots certificate data. This is a permissive license compatible with Apache-2.0, used for Mozilla's CA certificate bundle. * Fix coglet-python test task to install dependencies Use mise dir field and uv --with flags to install test dependencies (pytest, requests) before running tests. * CI: Add coglet-python tests and ensure rust+binstall before mise - Add test-coglet-python job to run Python binding tests - Install Rust toolchain and cargo-binstall before mise-action in all jobs to avoid compiling cargo tools from source (uses binstall instead) - Ensures correct order: rust -> binstall -> mise -> rust-cache * Fix coglet-python tests with dynamic port discovery and cog.yaml - Add cog.yaml files to test fixtures (required by both cog and cog-dataclass) - Implement dynamic port assignment using port 0 for tests - Add background thread to capture stderr logs for port discovery - Log actual bound port after TcpListener::bind() in coglet server - Update mise task to install dependencies in proper venv - All 12 coglet-python tests now passing * Require Python 3.10+ for coglet (matching cog-dataclass) - Update coglet-python requires-python from >=3.8 to >=3.10 - Change PyO3 ABI3 target from py38 to py310 - Update CI wheel paths from cp38-abi3 to cp310-abi3 - Matches cog-dataclass minimum Python version requirement * Skip integration tests with Python <3.10 or monobase for coglet (Rust) Add [coglet_rust] skip to integration tests that coglet cannot run: Python <3.10 (9 tests): - python37_deprecated.txtar (3.7) - optional_input.txtar (3.8) - setup_subprocess_multiprocessing.txtar (3.8) - apt_packages.txtar (3.9) - bad_dockerignore.txtar (3.9) - torch_baseimage_fallback.txtar (3.9) - torch_baseimage_no_cog_base.txtar (3.9) - torch_baseimage_precompile.txtar (3.9) - torch_cuda_baseimage.txtar (3.9) Monobase/fast build (7 tests): - build_base_image_sha.txtar - build_cog_version_match.txtar - build_localimage.txtar - build_python313_base_image.txtar - predict_localimage.txtar - fast_build.txtar - run_fast_build.txtar coglet (Rust) requires Python >=3.10 and does not support monobase. * CI: Use zig cross-compilation for Linux coglet wheel build Use build:coglet:wheel:linux instead of build:coglet:wheel to ensure manylinux2014 (GLIBC 2.17) compatibility. Building on modern Linux systems picks up too-recent GLIBC symbols (2.18-2.39) which breaks manylinux compliance. Zig cross-compilation targets the correct GLIBC version. * fixup! CI: Add coglet-python tests and ensure rust+binstall before mise Use astral-sh/setup-uv for proper venv management * Fix Secret serialization and skip pydantic2_output for cog-dataclass - Fix Secret.json_encode() to unwrap Secret objects before JSON serialization This fixes 'TypeError: Object of type Secret is not JSON serializable' when generating OpenAPI schemas with Secret input fields that have defaults - Skip pydantic2_output test for cog-dataclass/coglet_rust since it explicitly tests Pydantic 2 BaseModel features * Fix OpenAPI schema generation for training mode in coglet - Make cog-dataclass to_json_schema() mode-aware: - Accept Mode parameter (PREDICT or TRAIN) - Generate appropriate routes (/predictions or /trainings) - Use correct schema names (PredictionRequest/Response or TrainingRequest/Response) - Update coglet-python to pass HandlerMode through schema generation: - Pass mode from worker_bridge to predictor.schema() - Convert Rust HandlerMode enum to Python Mode enum - Pass mode to both pydantic (schema_via_fastapi) and cog-dataclass (to_json_schema) This fixes the "missing output definition for /trainings" error when using coglet with training endpoints, ensuring training has its own distinct Input/Output schemas based on the train() method signature. * Fix validation error messages to match pydantic format - Change 'missing required input field' to 'Field required' in cog-dataclass - Revert string_predictor.txtar test to expect 's: Field required' - Fix mise.toml to use 'python -m pytest' to avoid hardcoded shebang issues This ensures cog-dataclass validation errors match pydantic's format, making the integration test pass. * Fix subprocess output corrupting control channel in coglet When Python's setup() spawns subprocesses (subprocess.Popen), they inherit fd 1 (stdout) which is the control channel pipe to the orchestrator. Subprocess output writes directly into the pipe, corrupting JSON framing with raw text. This caused "frame size too big" errors when the bash subprocess in setup_subprocess_simple.txtar wrote 100 lines during setup. Solution: File descriptor redirection early in worker startup - Move control channel to high-numbered fds (99-101) - Replace fd 1/2 with capture pipes - Spawn threads to read pipes and route output through log system - Subprocess logs flow to coglet::user target (like orphan logs) - Log forwarder runs for entire worker lifetime Changes: - Add crates/coglet/src/fd_redirect.rs with fd manipulation - Update worker.rs to use redirected fds for control channel - Add RUST_LOG propagation in CLI commands (serve, predict, run) - Add RUST_LOG support in integration test harness - Add frame size and log forwarding trace logging for debugging - Route control channel logs to coglet::user target Safety contracts documented for all unsafe blocks. Passes clippy with no warnings. Fixes setup_subprocess_simple integration test. * Fix validation error format to show simple field messages Standardize validation error messages across cog-dataclass and coglet: - cog-dataclass: Change "missing required input field" to "Field required" - coglet: Parse Pydantic ValidationError to extract "field: message" format - Both now output simple "field: message" instead of verbose error details Fixes integration test string_predictor validation error matching. * Fix TrainingOutput serialization for Pydantic v2 models make_encodeable() now checks for model_dump() (Pydantic v2) before falling back to dict() (Pydantic v1). This fixes train() functions that return Pydantic BaseModel instances like TrainingOutput. Fixes train_basic integration test. * Fix validation error formatting in coglet to show simple messages Parse Pydantic ValidationError to extract simple "field: message" format instead of verbose error details. For ValueError (cog-dataclass), the message is already in the correct format. Includes clippy fixes for collapsible if statements. Fixes string_predictor integration test validation error matching. * Fix capture thread feedback loop and update to bounded channels Capture thread improvements: - Remove all tracing calls from capture threads to prevent feedback loop (stderr is redirected to capture pipe, so tracing would capture itself) - Update comments to reflect bounded channel (500 messages) instead of unbounded - Align safety comment with header documentation about tokio runtime threads Log writer updates: - Change SetupLogSender to use bounded Sender instead of UnboundedSender - Use blocking_send() for backpressure handling * Update cog-dataclass test to expect new validation error format Change test to expect "Field required" instead of "missing required input field" to match the updated error message format. * Fix blocking panic and add dropped log tracking Critical fixes: 1. Replace blocking_send() with try_send() in ControlChannelLogSender to prevent "Cannot block the current thread" panic when called from Python code on tokio runtime threads 2. Add DroppedLogs control message and periodic reporting (every 5s) to track logs dropped due to backpressure 3. Rename SetupLogSender -> ControlChannelLogSender with accurate documentation: - Lives for entire worker lifetime, not just setup - Handles both Python setup logs and subprocess output - Cleanup is no-op (sender persists) 4. Log dropped logs as warnings in orchestrator with human-readable intervals Fixes integration test panics during setup phase. * Ship worker tracing logs over IPC with preserved targets Add custom tracing layer for worker subprocess that ships structured tracing events over IPC to orchestrator, preserving component targets. Changes: - Add WorkerLog control message with target/level/message fields - Implement WorkerTracingLayer that intercepts tracing events before formatting - Initialize custom layer in worker with IPC channel sender - Orchestrator re-emits logs with original target using low-level tracing API - Optional fd 101 direct logging via RUST_WORKER_DIRECT_LOG=1 for debugging - Increased channel depth 500 → 5000 for trace-level logging support This replaces the previous approach where tracing went through capture pipes with ANSI codes, losing component information and appearing as generic 'coglet::setup' source. Note: Setuplog accumulation not yet implemented - logs are shipped and re-emitted but not added to SetupResult. Will be addressed in follow-up. * Prevent feedback loop in codec logging Filter out coglet::bridge::codec logs from IPC shipping to prevent infinite loop where encoding a WorkerLog message triggers codec trace logging, which creates another WorkerLog, which triggers more encoding, etc. Changes: - WorkerTracingLayer skips shipping logs with codec target - Added SAFETY comment in codec.rs explaining the feedback loop risk - Codec logs still visible via RUST_WORKER_DIRECT_LOG=1 for debugging * Accumulate coglet server setup logs via tracing layer Adds SetupLogAccumulator tracing layer that captures all logs from the coglet server process during the setup phase, from initial startup through worker Ready. Key implementation details: - Tracing layer installed before any logs emitted (captures "coglet <version>" log) - Accumulates orchestrator logs + re-emitted worker logs shipped over IPC - Critical ordering: accumulate → Ready received → drain → flip Ready state - Performance optimization: tx.is_closed() check prevents expensive ops after Ready - Channel semantics: unbounded for setup phase, drainable even after close - Log format: lines joined with \n separator, trailing newline at end The accumulated logs are returned in SetupResult.logs and exposed via /health-check endpoint for debugging setup failures. * Ensure 'Server ready' log is accumulated before stopping Keep the setup log receiver alive through the 'Server ready' log, then do a final drain to capture it before setting the setup result. The 'Setup complete, now accepting requests' log happens after the receiver is dropped, so it correctly does NOT appear in setup logs. Test verifies both behaviors: - 'Server ready' IS in setup logs - 'Setup complete' is NOT in setup logs * Stabilize coglet bindings runtime detection Split coglet bindings CI runs by runtime and align predictor ref parsing with cog loader semantics. - Parse predictor refs to use file stem instead of full dotted path (e.g., 'path/to/predict.py:Predictor' -> module='predict') - Split CI matrix to test both cog-dataclass and cog runtimes separately - Add mise tasks for testing each runtime independently - Ensures proper module import behavior matching Python's loader * Normalize coglet output serialization Handle dataclass BaseModel outputs and unwrap pydantic iterators before JSON encoding, plus rust-only setup log test guard. - Add dataclass conversion in make_encodeable() for BaseModel outputs - Unwrap pydantic serialization iterators in input processing and output serialization - Guard setup_worker_tracing_logs.txtar test for rust-only coglet - Add type safety check for numpy isinstance calls --------- Signed-off-by: Mark Phelps <mphelps@cloudflare.com> Co-authored-by: Mark Phelps <209477+markphelps@users.noreply.github.com>
Summary
Addresses review feedback from #2641 by replacing boolean "soup" with strongly-typed Rust enums across four key areas of the coglet codebase.
Changes
1. Predictor Method Characteristics (
predictor.rs)Before: 5 boolean fields
is_async,is_async_gen,has_train,is_train_async,is_standalone_functionAfter:
PredictorKindenum with nested structureBenefits: Invalid combinations like
is_async=false+is_async_gen=trueare now impossible at compile time.2. Prediction Results (
worker.rs)Before:
success: bool+error: Option<String>with string matchingAfter:
PredictionOutcomeenumBenefits: Type-safe pattern matching, no string comparisons for cancellation detection.
3. Worker Bridge Mode (
worker_bridge.rs)Before:
is_train: boolAfter:
HandlerModeenumBenefits: Explicit intent, clearer than boolean flag.
4. Slot State (
worker_bridge.rs)Before: 4 fields tracking state
cancelled: bool,in_progress: bool,is_async: bool,async_future: Option<Py<PyAny>>After:
SlotStateenum with proper state machineBenefits: State machine clarity, cancellation context preserved for choosing the right cancel mechanism (SIGUSR1 vs future.cancel()).
Impact
Testing
cargo check --all)Stats