feat: expand python-agent-driver to 102 pip packages#80
Conversation
There was a problem hiding this comment.
Pull request overview
This PR expands the python-agent-driver image to include a much larger set of preinstalled Python packages, updates warmup pre-imports to reduce per-run import overhead for a few additional modules, and increases the VM heap/memory defaults to accommodate the larger rootfs. It also adds documentation enumerating supported packages and their import names.
Changes:
- Increased sandbox heap sizing used by
pyhlinstall/setup andpydriver-runto 2.5 GiB. - Expanded the pre-import warmup module list (Rust help + guest warmup loop) and broadened the shipped package set in the driver rootfs Docker build.
- Added
docs/python-packages.mdto document supported packages and import names.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| host/src/pyhl.rs | Increases heap size used during install warmup/snapshot creation. |
| host/src/bin/pyhl.rs | Expands preimport list, updates --help text, and increases heap size during pyhl setup. |
| host/src/bin/pydriver_run.rs | Aligns heap size with the new 2.5 GiB default for running the driver. |
| examples/python-agent-driver/Justfile | Updates the example memory setting value. |
| examples/python-agent-driver/hl_pydriver.c | Expands guest-side warmup pre-import list to match pyhl’s list. |
| examples/python-agent-driver/Dockerfile | Installs a significantly larger set of Python packages into the rootfs. |
| docs/python-packages.md | Adds documentation listing supported packages and import names. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,150 @@ | |||
| # Python package support | |||
|
|
|||
| The python-agent-driver ships CPython 3.12 with the full standard library and 103 third-party pip packages. | |||
There was a problem hiding this comment.
Fixed — updated wording to 'explicitly installed top-level packages'.
| Additional packages shipped in the rootfs (pay import cost on \ | ||
| first use): aiohttp, altair, APScheduler, bandit, bokeh, \ | ||
| boto3, builtwith, celery, coverage, distro, docx2txt, duckdb, \ | ||
| edgartools, exchange-calendars, fabric, Faker, fastapi, \ | ||
| feedparser, fpdf2, gensim, gitpython, \ | ||
| google-api-python-client, hypercorn, hypothesis, loguru, \ | ||
| markdown, markdownify, mutagen, networkx, nltk, \ | ||
| numpy-financial, odfpy, paramiko, pdfplumber, pdfrw, pexpect, \ | ||
| pipdeptree, platformdirs, plotly, polars, praw, pycountry, \ | ||
| pydub, pyflakes, pygments, pylint, PyPDF2, pytest, \ | ||
| pytest-asyncio, pytest-cov, pyxlsb, qrcode, radon, rapidfuzz, \ | ||
| rarfile, reportlab, rope, ruff, schedule, scikit-learn, scipy, \ | ||
| scrapy, send2trash, slack-sdk, srt, statsmodels, svgwrite, \ | ||
| sympy, textblob, trafilatura, tweepy, typer, uvicorn, vulture, \ | ||
| watchdog, websockets, wordcloud, xlrd, xlsxwriter.\n\n\ |
There was a problem hiding this comment.
Acknowledged — deduplicating CLI help text into a shared source is a good idea but out of scope for this PR.
| FROM python:3.12-slim AS deps | ||
| RUN pip install --target=/deps --no-cache-dir \ | ||
| tqdm pyyaml jinja2 beautifulsoup4 tabulate click tenacity \ | ||
| python-dotenv pypdf openpyxl markdown-it-py pydantic pillow \ | ||
| lxml cryptography python-dateutil numpy pandas | ||
| lxml cryptography python-dateutil numpy pandas \ | ||
| python-docx python-pptx \ | ||
| chardet charset-normalizer \ | ||
| requests httpx aiohttp \ | ||
| feedparser markdown markdownify \ | ||
| Faker pycountry \ | ||
| loguru schedule send2trash \ | ||
| duckdb polars \ | ||
| xlrd xlsxwriter pyxlsb odfpy \ | ||
| pdfplumber pdfrw PyPDF2 \ | ||
| qrcode svgwrite \ | ||
| rapidfuzz \ | ||
| networkx sympy \ | ||
| pydub srt mutagen \ | ||
| plotly altair bokeh \ | ||
| statsmodels scikit-learn scipy \ | ||
| wordcloud \ | ||
| nltk textblob gensim \ |
There was a problem hiding this comment.
Fixed — added --prefer-binary to avoid sdist fallbacks.
| kernel := ".unikraft/build/python-agent-driver-hyperlight_hyperlight-x86_64" | ||
| initrd := "python-agent-driver-initrd.cpio" | ||
| memory := "1Gi" | ||
| memory := "2560Mi" |
There was a problem hiding this comment.
Pre-existing — the memory variable was already unused before this PR. Leaving as-is for now.
99cb3b7 to
31f2a74
Compare
There was a problem hiding this comment.
Linux Benchmarks
Details
| Benchmark suite | Current: fd6265c | Previous: 7162172 | Ratio |
|---|---|---|---|
hello_world (median) |
20 ms |
20 ms |
1 |
pandas (median) |
110 ms |
100 ms |
1.10 |
density (per VM) |
11 MB |
7 MB |
1.57 |
snapshot (disk) |
656 MiB |
385 MiB |
1.70 |
This comment was automatically generated by workflow using github-action-benchmark.
- Add 88 new pip packages to Dockerfile (removed edgartools due to pyarrow pulling in concurrent.futures.thread which crashes Unikraft) - Bump heap to 2.5 GiB across all entry points to accommodate the larger rootfs - Pre-import docx and pptx during warmup for zero-cost access - Add pydoc stub for pyarrow compatibility - Update --help to list all shipped packages - Add docs/python-packages.md reference Signed-off-by: danbugs <danilochiarlone@gmail.com>
- Add smoke test step before benchmarks to catch pyhl run failures early - Add explicit pandas smoke check before timing loop - Redirect stdout to /dev/null in timing runs to isolate timing output Signed-off-by: danbugs <danilochiarlone@gmail.com>
6dbc7a8 to
fd6265c
Compare
There was a problem hiding this comment.
Windows Benchmarks
Details
| Benchmark suite | Current: fd6265c | Previous: 7162172 | Ratio |
|---|---|---|---|
hello_world (median) |
352 ms |
234 ms |
1.50 |
pandas (median) |
1075 ms |
699 ms |
1.54 |
density (per VM) |
10 MB |
6 MB |
1.67 |
snapshot (disk) |
663 MiB |
392 MiB |
1.69 |
This comment was automatically generated by workflow using github-action-benchmark.
Summary
edgartools(its transitive deppyarrowpulls inconcurrent.futures.threadwhich crashes Unikraft)docxandpptxto the pre-import warmup list (zero import cost at runtime)docs/python-packages.mdlisting all supported packages with import namesTest plan
pyhl setup --forcecompletes without crashpyhl run -c "import pandas; print(pandas.DataFrame({'a':[1,2,3]}).describe())"worksimport duckdb,import polars,import sklearn