Skip to content

GFQL schema: validate physical column concordance at schema ingest #1640

@lmeyerov

Description

@lmeyerov

Summary

GraphSchema should validate physical-column concordance at schema specification ingest time, independent of whether the schema was hand-declared, imported from Arrow, or inferred from data.

PyGraphistry stores graph data as one node dataframe and one edge dataframe. Same-named properties across node labels share one physical node column; same-named properties across edge relationship types share one physical edge column.

Examples:

  • Cat.age and Dog.age both map to nodes["age"]
  • LIKES.weight and CHASES.weight both map to edges["weight"]

So a public schema object should not be allowed to carry incompatible same-column declarations such as Cat.age: int64 and Dog.age: string. That schema is internally contradictory for PyGraphistry's flattened table model and should fail or be explicitly resolved before callers reach schema.node_arrow(), schema.edge_arrow(), bind(schema=...), gfql_validate(...), or schema_validate=....

Current behavior

The conflict is caught too late. For example, current #1338 inference can produce contradictory fragments from label-filtered rows:

nodes = pd.DataFrame({
    "id": [1, 2],
    "age": [5, "old"],
    "label__Cat": [True, False],
    "label__Dog": [False, True],
})
g = graphistry.bind(node="id").nodes(nodes)
schema = graphistry.infer_schema(g)

Current inferred fragments:

Cat.age -> int64
Dog.age -> string

A later schema.node_arrow() then raises a conflicting Arrow declaration error for the shared physical nodes["age"] column. That late failure is directionally correct, but schema ingest should make the GraphSchema well-formed earlier.

Goal

Add a GraphSchema well-formedness validation pass for physical-column concordance across all schema specification paths:

Proposed semantics

  1. Per-label/type schema may track which properties are admitted/observed for each label/type.
  2. Reused physical node columns must have one compatible logical/Arrow type across node labels.
  3. Reused physical edge columns must have one compatible logical/Arrow type across relationship types.
  4. Nullability and presence may remain type-local. A nullable/non-nullable difference is not the same as a physical type conflict.
  5. Compatible same-column types are accepted.
  6. Incompatible same-column types fail fast with a stable, actionable error naming:
    • table: nodes or edges
    • physical column name
    • participating labels/types
    • conflicting logical/Arrow types

Acceptance

  • Declared schema construction rejects incompatible same-named node properties across labels.
  • Declared schema construction rejects incompatible same-named edge properties across relationship types.
  • GraphSchema.from_arrow(...) hits the same well-formedness checks.
  • infer_schema(...) does not return an internally contradictory GraphSchema; it either resolves through an explicit policy or surfaces a structured/reportable conflict.
  • Compatible same-named properties remain accepted.
  • Nullability differences remain type-local and do not produce false physical type conflicts.
  • Tests cover node and edge cases, including a Cat.age / Dog.age same-column conflict.
  • Docs explain that PyGraphistry's flattened node/edge tables require physical-column concordance even when per-label/type presence differs.

Coordination

Priority

Higher priority than graphistrygpt/Neptune/LLM reuse hooks. This is a core schema correctness invariant and should land before or directly after #1338.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions