Summary
GraphSchema should validate physical-column concordance at schema specification ingest time, independent of whether the schema was hand-declared, imported from Arrow, or inferred from data.
PyGraphistry stores graph data as one node dataframe and one edge dataframe. Same-named properties across node labels share one physical node column; same-named properties across edge relationship types share one physical edge column.
Examples:
Cat.age and Dog.age both map to nodes["age"]
LIKES.weight and CHASES.weight both map to edges["weight"]
So a public schema object should not be allowed to carry incompatible same-column declarations such as Cat.age: int64 and Dog.age: string. That schema is internally contradictory for PyGraphistry's flattened table model and should fail or be explicitly resolved before callers reach schema.node_arrow(), schema.edge_arrow(), bind(schema=...), gfql_validate(...), or schema_validate=....
Current behavior
The conflict is caught too late. For example, current #1338 inference can produce contradictory fragments from label-filtered rows:
nodes = pd.DataFrame({
"id": [1, 2],
"age": [5, "old"],
"label__Cat": [True, False],
"label__Dog": [False, True],
})
g = graphistry.bind(node="id").nodes(nodes)
schema = graphistry.infer_schema(g)
Current inferred fragments:
Cat.age -> int64
Dog.age -> string
A later schema.node_arrow() then raises a conflicting Arrow declaration error for the shared physical nodes["age"] column. That late failure is directionally correct, but schema ingest should make the GraphSchema well-formed earlier.
Goal
Add a GraphSchema well-formedness validation pass for physical-column concordance across all schema specification paths:
Proposed semantics
- Per-label/type schema may track which properties are admitted/observed for each label/type.
- Reused physical node columns must have one compatible logical/Arrow type across node labels.
- Reused physical edge columns must have one compatible logical/Arrow type across relationship types.
- Nullability and presence may remain type-local. A nullable/non-nullable difference is not the same as a physical type conflict.
- Compatible same-column types are accepted.
- Incompatible same-column types fail fast with a stable, actionable error naming:
- table:
nodes or edges
- physical column name
- participating labels/types
- conflicting logical/Arrow types
Acceptance
- Declared schema construction rejects incompatible same-named node properties across labels.
- Declared schema construction rejects incompatible same-named edge properties across relationship types.
GraphSchema.from_arrow(...) hits the same well-formedness checks.
infer_schema(...) does not return an internally contradictory GraphSchema; it either resolves through an explicit policy or surfaces a structured/reportable conflict.
- Compatible same-named properties remain accepted.
- Nullability differences remain type-local and do not produce false physical type conflicts.
- Tests cover node and edge cases, including a
Cat.age / Dog.age same-column conflict.
- Docs explain that PyGraphistry's flattened node/edge tables require physical-column concordance even when per-label/type presence differs.
Coordination
Priority
Higher priority than graphistrygpt/Neptune/LLM reuse hooks. This is a core schema correctness invariant and should land before or directly after #1338.
Summary
GraphSchemashould validate physical-column concordance at schema specification ingest time, independent of whether the schema was hand-declared, imported from Arrow, or inferred from data.PyGraphistry stores graph data as one node dataframe and one edge dataframe. Same-named properties across node labels share one physical node column; same-named properties across edge relationship types share one physical edge column.
Examples:
Cat.ageandDog.ageboth map tonodes["age"]LIKES.weightandCHASES.weightboth map toedges["weight"]So a public schema object should not be allowed to carry incompatible same-column declarations such as
Cat.age: int64andDog.age: string. That schema is internally contradictory for PyGraphistry's flattened table model and should fail or be explicitly resolved before callers reachschema.node_arrow(),schema.edge_arrow(),bind(schema=...),gfql_validate(...), orschema_validate=....Current behavior
The conflict is caught too late. For example, current #1338 inference can produce contradictory fragments from label-filtered rows:
Current inferred fragments:
A later
schema.node_arrow()then raises a conflicting Arrow declaration error for the shared physicalnodes["age"]column. That late failure is directionally correct, but schema ingest should make theGraphSchemawell-formed earlier.Goal
Add a
GraphSchemawell-formedness validation pass for physical-column concordance across all schema specification paths:GraphSchema(node_types=..., edge_types=...)GraphSchema.from_arrow(...)infer_schema(...)Proposed semantics
nodesoredgesAcceptance
GraphSchema.from_arrow(...)hits the same well-formedness checks.infer_schema(...)does not return an internally contradictoryGraphSchema; it either resolves through an explicit policy or surfaces a structured/reportable conflict.Cat.age/Dog.agesame-column conflict.Coordination
Priority
Higher priority than graphistrygpt/Neptune/LLM reuse hooks. This is a core schema correctness invariant and should land before or directly after #1338.