Skip to content

[FEA] GFQL native type system: schemas, inference, validation, and Arrow representation #1046

@lmeyerov

Description

@lmeyerov

Summary

Add a native type system to GFQL covering schema specification, inference, query validation, and typed data representation (Arrow). This is a foundational capability that would improve correctness, performance, and developer experience across the GFQL stack.

Motivation

  • Correctness: Catch schema mismatches at query compile time instead of runtime
  • Performance: Arrow-typed columns enable zero-copy GPU transfer and columnar optimization
  • Developer experience: Autocompletion, documentation, and error messages that reference schema
  • Interop: Typed schemas enable code generation for downstream consumers (TypeScript, Rust, etc.)

Scope

1. Schema specification

Define node and edge schemas declaratively:

from graphistry.schema import NodeType, EdgeType, GraphSchema

Person = NodeType("Person", {
    "id": int,
    "name": str,
    "age": Optional[int],
    "scores": list[float],
})

Company = NodeType("Company", {
    "id": int,
    "name": str,
    "founded": datetime,
})

WorksAt = EdgeType("WORKS_AT",
    source=Person,
    destination=Company,
    properties={
        "since": datetime,
        "role": str,
    },
)

schema = GraphSchema(
    node_types=[Person, Company],
    edge_types=[WorksAt],
)

Design considerations:

  • Multi-label: Cypher nodes can have multiple labels (`:Person:Employee`). Schema should support this — a node satisfies a type if it has ALL required labels.
  • Topology constraints: Edge types should declare valid (source_type, destination_type) pairs. `WORKS_AT` can only connect Person → Company.
  • Union types: A property might be `str | int` (heterogeneous data). Support via Python union syntax or explicit `Union[str, int]`.
  • Optional fields: Properties that may be null/missing. Align with `Optional[T]` / `T | None`.
  • Pydantic alignment: Consider using or extending Pydantic models for schema definition, getting validation, serialization, and IDE support for free.

2. Schema inference

Infer schemas from existing graph data:

schema = g.infer_schema()
# Returns GraphSchema with node types derived from label__* columns,
# edge types from relationship type column, property types from DataFrame dtypes

Design considerations:

  • Infer from `label__X` boolean columns (existing GFQL convention)
  • Infer property types from pandas/cudf dtypes → Arrow types
  • Handle mixed-type columns (object dtype) gracefully
  • Detect topology patterns (which edge types connect which node types)
  • Support incremental refinement: infer base schema, then user annotates/overrides

3. Query validation against schema

Validate GFQL chains and Cypher queries against a schema:

schema = GraphSchema(...)
g = g.bind(schema=schema)

# Compile-time validation:
g.gfql("MATCH (p:Person)-[:WORKS_AT]->(c:Company) RETURN p.age, c.nonexistent")
# → SchemaValidationError: Company has no property 'nonexistent'

g.gfql("MATCH (p:Person)-[:WORKS_AT]->(q:Person) RETURN p, q")
# → SchemaValidationError: WORKS_AT edge type requires destination=Company, got Person

Design considerations:

  • Validate at Cypher compile time (in lowering.py) — no runtime cost
  • Validate native GFQL chains via schema-aware `n()` / `e()` constructors
  • Provide helpful error messages referencing the schema definition
  • Optional strict mode vs permissive mode (warn vs error on unknown properties)

4. Arrow representation

Map schema types to Arrow types for efficient columnar storage:

schema.to_arrow_schema()
# Returns pyarrow.Schema with typed fields for each property

# Load/save with enforced types:
g = graphistry.from_arrow(nodes_table, edges_table, schema=schema)
g.to_arrow(schema=schema)  # Validates and casts to schema types

Design considerations:

  • Map Python types → Arrow types: `int → int64`, `str → utf8`, `list[float] → list`, etc.
  • Support cudf/RAPIDS Arrow interop
  • Enable zero-copy roundtrip: Arrow IPC → cudf → GFQL → Arrow IPC
  • Schema evolution: handle missing columns, extra columns, type coercion

Architecture questions

  1. Where does schema live? On the Plottable? As a separate object? Both?
  2. GFQL-first or Cypher-first? If we start at GFQL (schema-aware `n()`/`e()`), Cypher gets validation for free via the existing lowering path. Starting at Cypher requires mapping Cypher types to GFQL types.
  3. Pydantic integration depth: Full Pydantic models (with validation, serialization) vs lightweight dataclasses with Pydantic-style annotations?
  4. Inference vs declaration: Should `infer_schema()` produce the same schema objects as manual declaration? Or separate "inferred" vs "declared" types?
  5. Incremental adoption: How to add schemas to existing untyped graphs without breaking anything?

Prior art

  • Neo4j constraints: `CREATE CONSTRAINT ... REQUIRE (n.prop) IS :: INTEGER` — runtime enforcement
  • Apache AGE: PostgreSQL-based, inherits PostgreSQL type system
  • Kuzu: Built-in schema with typed node/edge tables
  • Pydantic: Python schema validation library — potential building block
  • Apache Arrow Schema: Columnar type system — the target representation
  • GraphQL: Typed schema for API queries — similar schema-first philosophy
  • openCypher type system: `INTEGER`, `FLOAT`, `STRING`, `BOOLEAN`, `LIST`, `MAP`, `PATH`, `NODE`, `RELATIONSHIP`

Suggested approach

  1. Spike: Define `NodeType`, `EdgeType`, `GraphSchema` dataclasses. Implement `infer_schema()` from existing graph data.
  2. Validate: Add schema validation to the Cypher lowering path — check property references and topology constraints at compile time.
  3. Arrow: Map schema to `pyarrow.Schema`, add `to_arrow()` / `from_arrow()` with schema enforcement.
  4. Iterate: Pydantic integration, multi-label support, union types based on real usage patterns.

Relationship to existing code

  • `graphistry/compute/gfql/cypher/lowering.py`: Cypher property references validated here — schema validation plugs in naturally
  • `graphistry/compute/ast.py`: `ASTNode`, `ASTEdge` could carry schema type info
  • `graphistry/Engine.py`: Engine resolution (pandas/cudf) — Arrow bridge point
  • `graphistry/Plottable.py`: Schema could attach here as `._schema`
  • `label__X` convention: Existing multi-label encoding — schema inference reads these

AI contributor notes

  • Repo AI guidance: `AGENTS.md`, `ai/README.md`
  • GFQL architecture: `ai/docs/` has guides for the query pipeline
  • Test patterns: `graphistry/tests/compute/gfql/cypher/test_lowering.py` (600+ tests)
  • The Cypher compiler is pure Python (no pandas dependency) — schema validation can be added without runtime overhead

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions