Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
<!-- Do Not Erase This Section - Used for tracking unreleased changes -->

### Added
- **GFQL schema inference API (#1338)**: Added experimental `graphistry.infer_schema(g)`, `g.infer_schema()`, and `g.bind(infer_schema=True)` for opt-in public `GraphSchema` inference from bound local graph data. Inference derives node/edge property logical types, presence/nullability report details, `label__*` node and relationship labels, and source/destination topology when node label evidence is available. Inferred schemas carry descriptive `GraphSchema.metadata` provenance (`source="inferred"` or `source="mixed"`). Declared schemas remain explicit and take precedence when passed to `infer_schema(..., schema=...)`; `bind(schema=..., infer_schema=True)` is rejected instead of silently merging contracts.
- **GFQL NetworkX CALL parity (#1058)**: Expanded the local Cypher `graphistry.nx.*` CALL surface with explicit NetworkX dispatch for `degree_centrality`, `closeness_centrality`, `eigenvector_centrality`, `katz_centrality`, `connected_components`, `strongly_connected_components`, `core_number`, and multi-output `hits`, including row and `.write()` coverage.
- **NetworkX/SciPy optional dependency policy (#1618)**: Declared supported `networkx>=2.5,<4` and optional `scipy>=1.5,<2` ranges for NetworkX-backed GFQL CALL procedures, with runtime version guards and a focused lower/current-upper CI matrix.
- **GFQL schema Arrow boundary APIs (#1339)**: Added experimental public schema↔Arrow import/export helpers, graph-level Arrow declaration payloads, and opt-in `schema_validate='strict'|'autofix'` enforcement for `plot()`, `upload()`, `to_arrow()`, and `validate_arrow_schema()` when a `GraphSchema` is bound.
Expand Down
125 changes: 111 additions & 14 deletions docs/source/gfql/schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ GFQL accepts public schema declarations through the stable
``graphistry.schema`` import path. Use this when application code owns a graph
contract and wants Cypher preflight checks to fail before query execution.
The API is experimental in this release: the import path and core declaration
objects are intended to be stable, while inference, coercion, remote transport,
and planner use are still follow-on surfaces.
objects are intended to be stable, while coercion, remote transport, and
planner use are still follow-on surfaces. Inference is also experimental and
must be requested explicitly.

The schema is optional. When you provide one, PyGraphistry uses it as the
declared contract for local GFQL validation. When you do not provide one,
Expand Down Expand Up @@ -95,6 +96,8 @@ Schema Objects
``GraphSchemaCatalog`` used by binder/preflight validation. ``strict=False``
makes schema-bound ``g.gfql_validate(...)`` permissive by default; callers can
still override per call with ``g.gfql_validate(..., strict=True)``.
``metadata`` is descriptive provenance for callers and exports; it is not part
of validation semantics.

``NodeType.to_arrow()`` and ``EdgeType.to_arrow()``
Export declarations as ``pyarrow.Schema`` objects through GFQL's row-schema
Expand Down Expand Up @@ -128,8 +131,8 @@ Invalid queries raise ``GFQLValidationError`` with structured context.
This is a correctness and documentation surface first: applications can state
what labels, relationship types, properties, and topology they expect, then
validate user-authored or generated Cypher before running it. The same typed
contract is also the foundation for later inference, coercion, remote transport,
and planner/performance work, but this page covers the declared local contract.
contract is also used by inference and is the foundation for later coercion,
remote transport, and planner/performance work.

Arrow Boundary Validation
-------------------------
Expand Down Expand Up @@ -163,22 +166,116 @@ boundaries. This is off by default so existing ``plot()``, ``upload()``, and
Provided vs. Inferred Schema
----------------------------

In this release, schemas are **provided**, not inferred. You create
``NodeType``, ``EdgeType``, and ``GraphSchema`` objects directly and attach them
with ``graphistry.bind(..., schema=schema)`` or ``g.bind(schema=schema)``.
You can provide a schema directly or infer one from bound local data.

Without an explicit ``GraphSchema``:
Use a provided schema when application code owns the contract:

* ``g.gfql_validate(...)`` can still use local dataframe columns already bound
on ``g._nodes`` and ``g._edges`` for schema-aware checks.
* It does not infer node types, edge types, Arrow dtypes, nullability, or
topology from data.
.. code-block:: python

declared_g = (
graphistry
.edges(edges_df, "src", "dst")
.nodes(nodes_df, "id")
.bind(schema=schema)
)

Use inference when the graph data should define the first draft contract:

.. code-block:: python

inferred_base_g = graphistry.edges(edges_df, "src", "dst").nodes(nodes_df, "id")
inferred_schema = inferred_base_g.infer_schema()
inferred_g = inferred_base_g.bind(schema=inferred_schema)

For one-step local binding, use:

.. code-block:: python

inferred_g = (
graphistry
.edges(edges_df, "src", "dst")
.nodes(nodes_df, "id")
.bind(infer_schema=True)
)

Inference is opt-in. ``graphistry.bind(...)`` and ``g.bind(...)`` do not infer a
schema unless ``infer_schema=True`` is passed.

Inference Rules
---------------

``graphistry.infer_schema(g)`` and ``g.infer_schema()`` return a public
``GraphSchema``. They inspect currently bound ``nodes`` and ``edges`` dataframes:

* Node types come from boolean ``label__<Label>`` columns on the node table.
* Edge types come from boolean ``label__<TYPE>`` columns on the edge table.
* Node properties are non-label node columns observed on rows for a label.
* Edge properties are non-label edge columns, excluding the bound source,
destination, and edge-id columns.
* Source/destination topology is inferred when edges reference bound node ids
and those nodes have label columns. Edge-only graphs keep edge types and
properties, but do not invent endpoint labels.
* A remote-only graph such as ``graphistry.bind(dataset_id="...")`` has no
local dataframe columns, so local validation is limited to syntax, compile,
and structural checks unless you also bind a declared schema.

Schema inference from existing plottables is tracked separately from this
declared-schema API.
Inference uses the same Arrow/GFQL row-schema bridge as declared schemas for
logical property types. The returned ``GraphSchema`` can be passed to
``g.bind(schema=schema)`` and used by ``g.gfql_validate(...)``.

Inferred schemas include descriptive provenance:

.. code-block:: python

inferred_schema.metadata["source"] == "inferred"

When declared definitions override inferred definitions through
``infer_schema(..., schema=schema)``, the returned schema uses
``metadata["source"] == "mixed"``. This metadata is informational; it does not
change preflight validation, Arrow validation, or schema equality.

Presence And Nullability
------------------------

The public ``GraphSchema`` stores the inferred logical type and scalar
nullability needed by validation. For more detail, request the experimental
report:

.. code-block:: python

schema, report = g.infer_schema(return_report=True)

The report tracks property presence separately from type:

``required``
The property has observed values on every row for that node label or edge
type.

``optional``
The property has observed values on some rows and nulls on other rows for
that node label or edge type.

``maybe_absent``
The column exists on the dataframe but has no observed value for that node
label or edge type. This commonly means another label/type uses the column.

``unknown``
No rows were available for that node label or edge type.

Declared Overrides
------------------

Declared schemas stay explicit. Passing ``schema=...`` to ``infer_schema()``
uses declared node and edge definitions in preference to inferred definitions
with the same names, while keeping inferred definitions for other names.

.. code-block:: python

refined_schema = g.infer_schema(schema=schema)

``g.bind(schema=..., infer_schema=True)`` is rejected. Use either a provided
schema or inferred schema in a single bind call so the validation contract is
unambiguous.

Local vs. Remote GFQL
---------------------
Expand Down
4 changes: 4 additions & 0 deletions graphistry/Plottable.py
Original file line number Diff line number Diff line change
Expand Up @@ -364,9 +364,13 @@ def bind(
nodes_file_id: Optional[str] = None,
edges_file_id: Optional[str] = None,
schema: Optional[Any] = None,
infer_schema: Any = False,
) -> 'Plottable':
...

def infer_schema(self, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
...

def copy(self) -> 'Plottable':
...

Expand Down
21 changes: 20 additions & 1 deletion graphistry/PlotterBase.py
Original file line number Diff line number Diff line change
Expand Up @@ -1622,6 +1622,7 @@ def bind(self,
nodes_file_id: Optional[str] = None,
edges_file_id: Optional[str] = None,
schema: Optional[Any] = None,
infer_schema: Any = False,
) -> Plottable:
"""Relate data attributes to graph structure and visual representation. To facilitate reuse and replayable notebooks, the binding call is chainable. Invocation does not effect the old binding: it instead returns a new Plotter instance with the new bindings added to the existing ones. Both the old and new bindings can then be used for different graphs.

Expand Down Expand Up @@ -1694,6 +1695,9 @@ def bind(self,
:param schema: Optional experimental public GFQL schema declaration from ``graphistry.schema``.
:type schema: Optional[Any]

:param infer_schema: Infer an experimental public GFQL schema from currently bound data and attach it.
:type infer_schema: bool

:returns: Plotter
:rtype: Plotter

Expand Down Expand Up @@ -1773,7 +1777,16 @@ def bind(self,
res._url = url or self._url
res._nodes_file_id = nodes_file_id or self._nodes_file_id
res._edges_file_id = edges_file_id or self._edges_file_id
res._gfql_schema = schema if schema is not None else self._gfql_schema
if schema is not None and infer_schema:
raise ValueError("schema and infer_schema cannot both be set")
if infer_schema and self._gfql_schema is not None:
raise ValueError("schema and infer_schema cannot both be set")
if infer_schema:
from graphistry.schema_inference import infer_schema as _infer_schema

res._gfql_schema = _infer_schema(res)
else:
res._gfql_schema = schema if schema is not None else self._gfql_schema

# Invalidate dataset_id if we're changing encodings, not setting IDs
encoding_params_changed = any([
Expand All @@ -1792,6 +1805,12 @@ def bind(self,
def copy(self) -> Plottable:
return copy.copy(self)

def infer_schema(self, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
"""Infer an experimental public GFQL schema from currently bound data."""
from graphistry.schema_inference import infer_schema

return infer_schema(self, schema=schema, return_report=return_report)


def nodes(self, nodes: Union[Callable, Any], node=None, *args, **kwargs) -> Plottable:
"""Specify the set of nodes and associated data.
Expand Down
7 changes: 7 additions & 0 deletions graphistry/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
encode_point_badge,
encode_edge_badge,
apply_encodings,
infer_schema,
hypergraph,
bolt,
cypher,
Expand Down Expand Up @@ -140,6 +141,12 @@
NodeType,
)

from graphistry.schema_inference import (
InferredProperty,
PresenceState,
SchemaInferenceReport,
)

from graphistry.privacy import (
Mode, Privacy
)
Expand Down
14 changes: 13 additions & 1 deletion graphistry/pygraphistry.py
Original file line number Diff line number Diff line change
Expand Up @@ -1925,12 +1925,14 @@ def bind(
nodes_file_id: Optional[str] = None,
edges_file_id: Optional[str] = None,
schema: Optional[Any] = None,
infer_schema: Any = False,
) -> Plotter:
"""Create a base plotter.

Typically called at start of a program. For parameters, see ``plotter.bind()`` .
The ``schema`` parameter accepts the experimental public GFQL schema
declarations from ``graphistry.schema``.
declarations from ``graphistry.schema``. ``infer_schema=True`` infers
that schema from currently bound local data.

:returns: Plotter
:rtype: Plotter
Expand Down Expand Up @@ -1974,8 +1976,17 @@ def bind(
nodes_file_id=nodes_file_id,
edges_file_id=edges_file_id,
schema=schema,
infer_schema=infer_schema,
))

def infer_schema(self, g: Optional[Any] = None, *, schema: Optional[Any] = None, return_report: bool = False) -> Any:
"""Infer an experimental public GFQL schema from a plotter."""
from graphistry.schema_inference import infer_schema

if g is None:
raise ValueError("graphistry.infer_schema(g) requires a plotter; use g.infer_schema() for bound graphs")
return infer_schema(g, schema=schema, return_report=return_report)

def from_dataset_id(self, dataset_id: str, api_token: Optional[str] = None) -> Plotter:
"""Fetch existing remote dataset metadata and hydrate a Plotter.

Expand Down Expand Up @@ -2763,6 +2774,7 @@ def _handle_api_response(self, response):
encode_point_badge = PyGraphistry.encode_point_badge
encode_edge_badge = PyGraphistry.encode_edge_badge
apply_encodings = PyGraphistry.apply_encodings
infer_schema = PyGraphistry.infer_schema
infer_labels = PyGraphistry.infer_labels
name = PyGraphistry.name
description = PyGraphistry.description
Expand Down
8 changes: 8 additions & 0 deletions graphistry/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,7 @@ class GraphSchema:
node_id_column: Optional[str] = None
edge_source_column: Optional[str] = None
edge_destination_column: Optional[str] = None
metadata: Mapping[str, Any] = field(default_factory=dict, compare=False)

def __init__(
self,
Expand All @@ -341,13 +342,15 @@ def __init__(
node_id_column: Optional[str] = None,
edge_source_column: Optional[str] = None,
edge_destination_column: Optional[str] = None,
metadata: Optional[Mapping[str, Any]] = None,
) -> None:
object.__setattr__(self, "node_types", tuple(node_types))
object.__setattr__(self, "edge_types", tuple(edge_types))
object.__setattr__(self, "strict", bool(strict))
object.__setattr__(self, "node_id_column", node_id_column)
object.__setattr__(self, "edge_source_column", edge_source_column)
object.__setattr__(self, "edge_destination_column", edge_destination_column)
object.__setattr__(self, "metadata", dict(metadata or {}))

@property
def node_columns(self) -> FrozenSet[str]:
Expand Down Expand Up @@ -441,6 +444,7 @@ def to_arrow(
"node_id_column": self.node_id_column,
"edge_source_column": self.edge_source_column,
"edge_destination_column": self.edge_destination_column,
"metadata": dict(self.metadata),
}

@classmethod
Expand All @@ -454,6 +458,7 @@ def from_arrow(
node_id_column: Optional[str] = None,
edge_source_column: Optional[str] = None,
edge_destination_column: Optional[str] = None,
metadata: Optional[Mapping[str, Any]] = None,
coercion: CoercionMode = "widen",
) -> "GraphSchema":
"""Import graph schema declarations from Arrow schemas.
Expand All @@ -471,6 +476,7 @@ def from_arrow(
Optional[str],
declaration.get("edge_destination_column", edge_destination_column),
)
metadata = cast(Optional[Mapping[str, Any]], declaration.get("metadata", metadata))

nodes = tuple(
NodeType.from_arrow(name, schema, coercion=coercion)
Expand All @@ -487,6 +493,7 @@ def from_arrow(
node_id_column=node_id_column,
edge_source_column=edge_source_column,
edge_destination_column=edge_destination_column,
metadata=metadata,
)

def to_catalog(
Expand Down Expand Up @@ -514,6 +521,7 @@ def to_catalog(
edge_type.name: edge_type.to_row_schema(include_type_label=False)
for edge_type in self.edge_types
},
"schema_metadata": dict(self.metadata),
}
return GraphSchemaCatalog.from_schema_parts(
node_columns=self.node_columns,
Expand Down
Loading
Loading