Skip to content

fix: address review issues in static schema generation#2805

Open
markphelps wants to merge 15 commits intofix/dict-list-output-schemafrom
mp/fix/dict-list-output-schema-fixes
Open

fix: address review issues in static schema generation#2805
markphelps wants to merge 15 commits intofix/dict-list-output-schemafrom
mp/fix/dict-list-output-schema-fixes

Conversation

@markphelps
Copy link
Contributor

@markphelps markphelps commented Mar 4, 2026

Summary

Addresses remaining issues from thefix/dict-list-output-schema branch.

Bug fixes

  • Cycle detection for recursive BaseModelsResolveSchemaType now tracks a seen set to prevent infinite recursion on self-referential models like class TreeNode(BaseModel): child: Optional["TreeNode"]. Cycles emit an opaque {"type": "object"} instead of stack overflowing.
  • Forward reference handling in model fieldsparseTypeFromString now strips quotes from forward references like "MyType" or 'MyType' inside generic annotations (e.g. Optional["TreeNode"]). Previously these fields were silently dropped.
  • Deterministic JSON property orderSchemaObject.coreSchema() now uses orderedMapAny instead of map[string]any for the properties field, ensuring consistent JSON output.

Code cleanup

  • Explicit panic for unknown SchemaTypeKind — replaced silent fallthrough returning {"type": "object"} with a panic that surfaces missing switch cases immediately.
  • Removed redundant nullableSchemaObject field loop no longer re-sets nullable since jsonSchema() already handles it.
  • Removed unused KeyType field — dict key types are always strings in JSON Schema; the field was set but never read.

Test coverage (8 new tests)

  • TestForwardReferenceOutput / TestForwardReferenceGeneric — quoted string annotations
  • TestNestedBaseModelFieldclass Outer(BaseModel): inner: Inner composition
  • TestListOfBaseModelOutput / TestIteratorOfBaseModelOutput — BaseModel in generic outputs
  • TestRecursiveBaseModelDoesNotStackOverflow — self-referential model cycle detection
  • TestOutputObjectPropertyOrderDeterministic — JSON property ordering

Not addressed (documented, behind opt-in gate)

  • Cross-file depth-1 resolution limit
  • No __init__.py package import support
  • Fuzz tests don't exercise BaseModel/cross-file paths

tempusfrangit and others added 14 commits February 27, 2026 16:40
The tree-sitter schema parser rejected unparameterized dict and list
return types with 'unsupported type'. This broke the resnet example
and any predictor returning -> dict or -> list.

- dict/Dict map to TypeAny (JSON Schema: {"type": "object"})
- list/List map to OutputList with TypeAny (JSON Schema: {"type": "array", "items": {"type": "object"}})

This matches the old Python schema gen behavior exactly:
  dict -> Dict[str, Any] -> {"type": "object"}
  list -> List[Any] -> {"type": "array", "items": {"type": "object"}}
…nsive tests

Replace the flat OutputType system with a recursive SchemaType algebraic data
type that supports arbitrary nesting (dict[str, list[dict[str, int]]], etc.).

Key changes:
- SchemaType ADT with 7 kinds: Primitive, Any, Array, Dict, Object, Iterator, ConcatIterator
- ResolveSchemaType: recursive resolver replacing ResolveOutputType
- Cross-file model resolution: imports from local .py files are found on disk,
  parsed with tree-sitter, and BaseModel subclasses extracted automatically
- Handles all local import permutations: relative, dotted, subpackage, aliased
- Clear error messages for unresolvable types (includes import source and guidance)
- Remove legacy OutputType, OutputKind, ObjectField, ResolveOutputType
- Thread sourceDir through Parser -> ParsePredictor for filesystem access
- Rewrite architecture/02-schema.md for the static Go parser

Tests: 93 unit tests (12 recursive nesting, 5 unresolvable errors, 3 pydantic
compat, 11 cross-file resolution, 1 end-to-end schema gen) + 1 integration test
…ursive model fields

- Propagate errors from dict value type resolution instead of silently
  falling back to opaque SchemaAny (dict[str, Tensor] now errors)
- Extract UnwrapOptional helper used by ResolveFieldType,
  resolveUnionSchemaType, and resolveFieldSchemaType (3 callsites)
- resolveModelToSchemaType now uses ResolveSchemaType via
  resolveFieldSchemaType, supporting dict/nested types inside BaseModel
  fields (previously limited to primitives, Optional[T], List[T])
- Fix stale comments: Optional rejected not nullable, Iterator allows
  nested types
- Fix garbled unicode in architecture docs, fix SchemaAny table entry
- Fix nullable incorrectly set on non-required fields instead of
  field.Type.Nullable (debug: bool = False was appearing nullable)
- Remove dead CogArrayType/CogArrayDisplay struct fields (hardcoded
  in coreSchema, never read from struct)
- Distinguish os.ErrNotExist from permission errors in cross-file
  resolution; warn on parse failures instead of silently ignoring
- Fix bare dict/list JSON Schema examples in architecture docs
- Add regression tests: defaulted non-Optional field not nullable,
  Optional field nullable in JSON Schema, dict[str, Tensor] errors
  instead of silently producing SchemaAny (both top-level and inside
  BaseModel fields)
Four fuzz targets exercising the schema pipeline:
- FuzzResolveSchemaType: arbitrary TypeAnnotation trees through the
  recursive resolver
- FuzzJSONSchema: random SchemaType trees through JSON Schema rendering
- FuzzParsePredictor: arbitrary bytes as Python source through the
  tree-sitter parser
- FuzzParseTypeAnnotation: arbitrary return type strings in predict
  signatures

Includes:
- mise task 'test:fuzz' (FUZZTIME=30s per target by default)
- CI job 'fuzz-go' running 30s per target on Go changes
- Byte encoder/decoder for deterministic TypeAnnotation and SchemaType
  tree construction from fuzz corpus
* fix: make static schema generation opt-in via COG_STATIC_SCHEMA env var

The static Go tree-sitter schema generator was the default for SDK >= 0.17.0,
which risks breaking builds when the parser encounters types it cannot resolve.

- Gate static schema gen behind COG_STATIC_SCHEMA=1 (or "true") env var
- Legacy runtime schema generation (boot container + python introspection)
  remains the default
- When opted in, gracefully fall back to legacy on ErrUnresolvableType
  instead of hard-failing the build
- Add unit tests for canUseStaticSchemaGen (12 table-driven cases)
- Add integration test for the static->legacy fallback path
- Update existing static/multi-file integration tests to set the env var

* docs: update schema architecture doc for opt-in static generation model

Reflect that static schema generation is opt-in via COG_STATIC_SCHEMA=1,
with legacy runtime path as the default and automatic fallback.

* fix: restore legacy runtime schema generation modules for fallback path

Restore the Python modules needed by the legacy runtime schema generation
path (python -m cog.command.openapi_schema). These were deleted in 61eedf3
but are needed as the fallback when the static Go parser encounters types
it cannot resolve.

Restored modules: _adt, _inspector, _schemas, coder, config, errors, mode,
suppress_output, command/__init__, command/openapi_schema.

config.py is trimmed to only what openapi_schema.py needs (removed
get_predictor_types which depended on deleted get_predict/get_train).

* fix: simplify schema gen gating — let coglet handle missing schema for train/serve

Remove the skipLabels override that forced static schema gen for
cog train/serve paths. Now useStatic is purely opt-in via
COG_STATIC_SCHEMA=1 for all commands.

For train/serve without the env var, no schema is generated at
build time. Coglet gracefully handles this (warns and accepts all
input). These are local-only images that don't need strict schema
validation.

Also improves the static_schema_fallback integration test to use
a realistic mypackage/__init__.py scenario instead of a plain class.

* fix: restore static gen for train/serve paths, handle pydantic v2 in legacy inspector

Two IT failures:

1. training_setup: cog train needs schema for -i flag parsing. The CLI
   fetches it from coglet's /openapi.json, which returns 503 when no
   schema file exists. Re-enable static gen for skipLabels paths (same
   as main) since there's no post-build legacy fallback for these.

2. pydantic2_output: the legacy runtime inspector (_inspector.py) didn't
   handle pydantic v2 BaseModel as output types — only cog.BaseModel.
   Add conditional pydantic.BaseModel check with model_fields iteration.

* fix: suppress pyright reportMissingImports for optional pydantic import

* fix: flatten nested error handling to early returns, case-insensitive env var check
Add cycle detection for recursive BaseModel types to prevent stack
overflow, use ordered maps for deterministic JSON property output,
fix forward reference handling in model field type annotations, and
close test coverage gaps for nested BaseModels, list/Iterator of
BaseModel, and property ordering.
@markphelps markphelps requested a review from a team as a code owner March 4, 2026 21:14
// it indicates a missing case (e.g. a new kind was added without updating
// this switch). Panic to surface the bug immediately rather than silently
// returning a wrong schema.
panic(fmt.Sprintf("unhandled SchemaTypeKind: %d", s.Kind))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controversial

…TypeFromString

Fuzz testing found that a bare quote character (e.g. `"`) matched the
forward reference check but caused s[1:0] slice panic. Add len >= 2
guard before slicing.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses remaining review items in static JSON schema generation for Python predictors, focusing on BaseModel recursion safety, forward-reference parsing, and deterministic JSON output.

Changes:

  • Add cycle detection when resolving recursive Pydantic BaseModel output schemas to prevent infinite recursion.
  • Improve parsing of quoted forward references inside string-based annotations.
  • Make JSON Schema object properties ordering deterministic by using an ordered map type, and add/adjust tests accordingly.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pkg/schema/schema_type.go Removes unused dict key typing, adds ordered properties output, adds cycle detection in model resolution, and panics on unhandled SchemaTypeKind.
pkg/schema/python/parser.go Adds quote-stripping forward-reference handling in parseTypeFromString.
pkg/schema/python/parser_test.go Adds new tests for forward refs, nested BaseModels, BaseModel generics, recursion safety, and deterministic property order; adjusts existing nullable assertions via JSON round-trip.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +688 to +692
// Forward reference: quoted string like "MyType" or 'MyType'
if len(s) >= 2 &&
((strings.HasPrefix(s, "\"") && strings.HasSuffix(s, "\"")) ||
(strings.HasPrefix(s, "'") && strings.HasSuffix(s, "'"))) {
inner := s[1 : len(s)-1]
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forward-reference quote stripping happens after the | union parsing. A quoted union like "TreeNode | None" will be split on | before quotes are removed, causing parsing to fail and the annotation to be dropped. Consider stripping surrounding quotes immediately after TrimSpace (before union/generic handling) so quoted unions/generics are handled correctly.

Copilot uses AI. Check for mistakes.
betaIdx := indexOf(jsonStr, `"beta"`)
gammaIdx := indexOf(jsonStr, `"gamma"`)
deltaIdx := indexOf(jsonStr, `"delta"`)

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test compares ordering via substring indices but doesn't assert that each key was actually found (index != -1). If a key is missing, the failure can be a confusing ordering assertion (e.g. -1 > -1). Add explicit assertions that each index is >= 0 before comparing their relative order.

Suggested change
require.GreaterOrEqual(t, alphaIdx, 0, `"alpha" property should be present in JSON schema`)
require.GreaterOrEqual(t, betaIdx, 0, `"beta" property should be present in JSON schema`)
require.GreaterOrEqual(t, gammaIdx, 0, `"gamma" property should be present in JSON schema`)
require.GreaterOrEqual(t, deltaIdx, 0, `"delta" property should be present in JSON schema`)

Copilot uses AI. Check for mistakes.
@michaeldwan michaeldwan force-pushed the fix/dict-list-output-schema branch from 1e78590 to 575454e Compare March 5, 2026 17:53
@markphelps markphelps added this to the 0.18.0 milestone Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants