feat(python): add Python bindings for ggsql#74
Conversation
e30a71a to
874472b
Compare
Adds Python bindings for ggsql via PyO3/maturin, enabling Python users to render Vega-Lite visualizations from polars DataFrames. Features: - split_query(): Split a ggSQL query into SQL and VISUALISE parts - render(): Render a DataFrame with a VISUALISE spec to Vega-Lite JSON - Supports explicit, implicit, and wildcard mappings - Works with Python polars >= 1.0 Includes: - Unit tests and E2E tests with altair - GitHub Actions workflow for Python 3.10-3.14 - README with installation and development guide Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
874472b to
994062e
Compare
- Add default-members to Cargo.toml so cargo build/test work without requiring --exclude flags. ggsql-python requires Python dev headers and must be built separately with maturin. - Fix invalid import path in ggsql-python (ggsql::parser::ast doesn't exist, AestheticValue is exported at crate root) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove speculative <2.0 upper bound on maturin - Split extras into test, e2e, and dev for better granularity - Update CI workflow to use extras instead of hardcoded pip installs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix wheel install using --find-links instead of glob patterns - Add Rust caching (Swatinem/rust-cache) to speed up builds - Enable sccache in maturin-action for compilation caching - Remove Python 3.14 from matrix (still experimental) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add _to_polars() helper to convert narwhals/pandas DataFrames to polars - Use GLOBAL_DATA_KEY constant instead of hardcoded "__global__" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add narwhals dependency for handling IntoFrame types - Simplify render() to accept any narwhals-compatible DataFrame - Add tests for GLOBAL_DATA_KEY fix, narwhals, and pandas DataFrames - Fix test_split_query to test VISUALISE FROM behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The maturin build outputs to the workspace root's target/wheels/ directory, not ggsql-python/target/wheels/. This was causing pip to fail finding the wheel. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
pyarrow is not required - pyo3-polars handles DataFrame conversion without Arrow FFI. Updated README to reflect narwhals support. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The pyo3-polars crate requires pyarrow for the Python-to-Rust DataFrame conversion. Without it, Series.to_arrow() fails with ModuleNotFoundError. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The generated parser.c and related files are required for the build. While Unix builds can regenerate these via tree-sitter-cli, Windows CI often fails due to missing npm/tree-sitter dependencies. Committing these files is the standard practice for tree-sitter grammars. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| # Tree-sitter generated files | ||
| /tree-sitter-ggsql/src/parser.c | ||
| /tree-sitter-ggsql/src/tree_sitter/ | ||
| /tree-sitter-ggsql/src/node-types.json | ||
| /tree-sitter-ggsql/src/grammar.json | ||
|
|
There was a problem hiding this comment.
This change was needed since tree-sitter generate was failing on Windows. Claude offered a few different ways to fix this, but recommended this approach:
| Approach | Repo Size | Build Speed | Reliability | Complexity |
|---|---|---|---|---|
| Commit generated files | +1.3MB | Fast | High | Low |
| CI installs tree-sitter-cli | Small | Slow | Medium | Low |
| tree-sitter-cli crate | Small | Slow | High | Medium |
| Conditional build.rs | Small* | Fast* | Medium | Medium |
There was a problem hiding this comment.
If the issue is simply that tree sitter is not installed in CI, I'd rather not check in the generated files and instead install it on Windows using something like https://github.com/tree-sitter/setup-action/tree/master.
Those generated files can get very large as grammars get complex, and I'd like to avoid the noise if we can.
There was a problem hiding this comment.
I do recall having to do a little 'nvm' song and dance to please installation on windows, but I've forgotten the details already.
- Add altair as required dependency (moved from optional e2e) - render() now returns altair.Chart instead of JSON string - Add @overload decorator for future writer support - Update tests to work with Chart return type - Simplify CI workflow by removing separate e2e job - Update README with new API and usage examples Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Validates writer value in Python before calling into Rust, providing a clearer error message with supported options. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove test_altair_e2e.py (e2e naming no longer meaningful) - Consolidate all tests into test_ggsql.py - Focus tests on Python-specific logic: - DataFrame conversion via narwhals - Writer parameter validation - Return type handling - Remove tests that were testing Rust logic (chart structure, encodings, etc.) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename render() to render_altair() for clearer API - Remove writer parameter (always renders to Altair) - Add **kwargs pass-through to altair.Chart.from_json() - Update tests and README Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace pyo3-polars PyDataFrame with IPC byte serialization - Use polars native IPC reader/writer for Python-Rust data transfer - Remove pyarrow from dependencies (saves ~117MB install size) - Add proper Altair chart type detection (LayerChart, FacetChart, etc.) - Wheel size: 5.8MB → 6.1MB (but eliminates 117MB pyarrow dep) The IPC approach adds <2ms overhead for 1M rows, negligible compared to Altair's JSON parsing which dominates execution time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add section 8 documenting ggsql-python package - Update feature flags to reflect Python is implemented (not planned) - Update feature dependencies mapping for ggsql-python Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@georgestagg or @thomasp85 this is ready for a review whenever. BTW, If neither of you have interest/bandwidth in reviewing the Python package (which, at least for now, I'm primarily wanting for querychat), then I'd be happy to gain write access and be considered a "maintainer" of the Python package. I'll handle submitting to PyPI, etc. |
|
I am going to be very busy this week with other things, but I will take a look at this when I can. My apologies if it blocks the querychat work in the meantime. I would like to think carefully about the consequences of checking in the tree sitter artefacts, and whether the |
|
I'm afraid my knowledge and interest in Python is too limited to provide any sensible review |
Windows CI builds require tree-sitter-cli to generate parser files at build time. This adds the npm package installation to all workflows that build Rust code and removes the pre-generated parser files from version control. Changes: - Install tree-sitter-cli via npm in build, python, publish, and R workflows - Update build.rs to generate parser files during compilation - Remove pre-generated tree-sitter files from git (grammar.json, parser.c, etc.) - Add generated files to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8cc8caa to
8ffdc1a
Compare
Doc tests were causing linker crashes (Bus error) on CI due to memory pressure when linking each doc test binary against the large dependency tree (polars, duckdb, etc.). Skip doc tests in CI since unit tests provide the real coverage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
746d63b to
d0bb9fe
Compare
If you bang up against this again, I've used this little snippet in the past. A little cheeky, perhaps, but it seems to work. We could also try the with the posit-dev runners. |
There was a problem hiding this comment.
OK, this looks good to go in. The infrastructure looks good, thank you for making the Python build optional.
Now, the tricky bit. I am going to rework the Rust API, as discussed elsewhere, and so the Python package will need to change in response. I will handle that, but in a followup PR and tag you for review. I will try to keep the render_altair() interface as it currently is presented to the end user.
This PR introduces
ggsql-python, a new Python package providing PyO3 bindings for the ggsql crate. The package enables Python users to render Altair charts from DataFrames using ggsql's VISUALISE syntax.ggsql-pythonpackage with PyO3 bindings exposingsplit_query()andrender_altair()functionsrender_altair()returns analtair.Chartdirectly for easy display and customizationKey Features
API:
DataFrame Support:
Mapping Styles:
VISUALISE x AS x, y AS yVISUALISE x, y(column name matches aesthetic)VISUALISE *(map all matching columns)Building & Installation
Requirements:
From source (development mode):
Build a wheel:
Run tests:
Dependencies:
altair>=5.0- Chart renderingnarwhals>=2.15.0- DataFrame compatibility layerpolars>=1.0- Native DataFrame supportpyarrow>=14.0- Required by pyo3-polarsTest plan
split_query()correctly separates SQL and VISUALISE portionsrender_altair()accepts polars DataFramesrender_altair()accepts polars LazyFrames (auto-collected)render_altair()accepts narwhals DataFramesrender_altair()accepts pandas DataFrames (via narwhals)render_altair()rejects invalid DataFrame types withTypeErrorrender_altair()returnsaltair.Chartinstancerender_altair()passes**kwargsthrough toaltair.Chart.from_json()ValueError🤖 Generated with Claude Code