chore: centralize default LLM model + harden benchmark exception handling#8
Conversation
…ling Two follow-ups intentionally deferred from PR #6: 1. **Centralize `DEFAULT_MODEL`** as the single source of truth across the Copilot-backed LLM modules. Previously every module duplicated the literal `"gpt-5.2"` default, which made silent drift possible (one module could lag behind a future model bump). - `featcopilot/utils/__init__.py`: re-export `DEFAULT_MODEL`. - `featcopilot/llm/copilot_client.py`: `CopilotConfig.model` & `CopilotFeatureClient.__init__` now default to `DEFAULT_MODEL`. - `featcopilot/llm/code_generator.py`, `featcopilot/llm/explainer.py`, `featcopilot/llm/transform_rule_generator.py`, `featcopilot/llm/semantic_engine.py` (config + `__init__`): same treatment. - `featcopilot/transformers/sklearn_compat.py`: `AutoFeatureEngineer._create_engine("llm")` falls back to `DEFAULT_MODEL` instead of a hardcoded literal. - Benchmark runners (`simple_models`, `automl`, `compare_tools`) pull the same constant in their `llm_config` builders. Behaviour unchanged: `DEFAULT_MODEL == "gpt-5.2"` still. OpenAI / LiteLLM clients intentionally keep `"gpt-4o"` (they use a different backend with its own public model identifier). 2. **Harden benchmark exception handling** in `benchmarks/simple_models/run_simple_models_benchmark.py`: - Replace the bare `except Exception` around per-fold FeatCopilot fit/transform with a narrowed `_EXPECTED_FE_FAILURES` list (`ValueError, KeyError, TypeError, RuntimeError, MemoryError, LinAlgError`) plus a separate `except Exception` branch that logs the full traceback via `logger.exception`. Genuine bugs (`AttributeError`, `ImportError`, ...) no longer hide behind a silent baseline-fallback. - Record per-fold failures in a new `fe_failed_folds` / `n_fe_failed_folds` field on the results dict so consumers can see when the `tabular_best_score` is a fair comparison vs a fallback-padded one. - Replace `print(...)` with `logger.warning` / `logger.exception` so failures show up in log aggregators / CI summaries. - Wilcoxon `except ValueError` now logs (was silently setting `p_value = 1.0`). - Top-level dataset-loop `except Exception` switched to `logger.exception` (was `print + traceback.print_exc`). - `_resolve_source` narrowed from `except Exception` to `(KeyError, ValueError, TypeError)`. Tests: - `tests/test_utils.py`: 3 new tests pinning `DEFAULT_MODEL` is re-exported from `featcopilot.utils`, every Copilot-backed LLM module / config defaults to `DEFAULT_MODEL`, and `AutoFeatureEngineer.llm_config` falls back to `DEFAULT_MODEL`. - `tests/test_benchmark_encoding.py`: 4 new tests pinning the narrowed exception list, propagation of unexpected errors out of `_resolve_source`, the centralized model in `llm_config`, and the module-owned `logger` attribute. Full pytest: 924 passed, 2 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8 +/- ##
==========================================
+ Coverage 89.32% 89.34% +0.01%
==========================================
Files 38 38
Lines 4358 4364 +6
==========================================
+ Hits 3893 3899 +6
Misses 465 465
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR refactors Copilot-backed LLM components to use a single centralized DEFAULT_MODEL constant (reducing drift risk) and hardens benchmark exception handling so failures are observable while keeping benchmark runs resilient.
Changes:
- Re-export
DEFAULT_MODELfromfeatcopilot.utilsand replace duplicated"gpt-5.2"literals across Copilot-backed LLM modules, sklearn adapter fallback paths, and benchmark runners. - Improve
benchmarks/simple_models/run_simple_models_benchmark.pyerror handling by distinguishing expected vs unexpected per-fold FE failures, logging stack traces for unexpected errors, and recording per-fold failure metadata in results. - Add regression tests covering default-model centralization and benchmark exception-handling behavior/logging.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
featcopilot/utils/__init__.py |
Re-exports DEFAULT_MODEL to make it available from the public featcopilot.utils namespace. |
featcopilot/llm/copilot_client.py |
Switches Copilot client config/init defaults to centralized DEFAULT_MODEL. |
featcopilot/llm/code_generator.py |
Uses DEFAULT_MODEL as the default model for the code generator. |
featcopilot/llm/explainer.py |
Uses DEFAULT_MODEL as the default model for the explainer. |
featcopilot/llm/semantic_engine.py |
Updates Pydantic config + engine init default to DEFAULT_MODEL. |
featcopilot/llm/transform_rule_generator.py |
Uses DEFAULT_MODEL as the default model and updates docstring accordingly. |
featcopilot/transformers/sklearn_compat.py |
Replaces the sklearn adapter LLM fallback model literal with DEFAULT_MODEL. |
benchmarks/simple_models/run_simple_models_benchmark.py |
Centralizes benchmark LLM config model and adds structured/logged exception handling with failure tracking. |
benchmarks/automl/run_automl_benchmark.py |
Uses DEFAULT_MODEL in benchmark LLM config builder. |
benchmarks/compare_tools/run_fe_tools_comparison.py |
Uses DEFAULT_MODEL in benchmark LLM config builder. |
tests/test_utils.py |
Adds tests pinning the DEFAULT_MODEL re-export and ensuring consistent LLM-module defaults. |
tests/test_benchmark_encoding.py |
Adds tests pinning narrowed exception swallowing, expected-failure types, centralized model usage, and module logger presence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The previous comment claimed the stdlib `logging` choice was about 'staying usable before featcopilot is installed' — but this PR added a direct `from featcopilot.utils.models import DEFAULT_MODEL` right above the comment, which contradicts that claim. The actual motivation is that `featcopilot.utils.logger` sets `propagate=False` on the `featcopilot.*` logger tree (see `featcopilot/utils/logger.py:20`), which would block benchmark output from reaching root-logger handlers configured by downstream consumers (CI runners, log aggregators, `pytest --log-cli`). Updated the comment to reflect that real motivation; no behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@copilot @codex please review the whole PR (not just the latest commit). Round-1 fix (e14a078) addressed the Copilot reviewer thread on All checks green; please look at the full PR scope:
|
Full PR review complete — here's my assessment of all the changes: DEFAULT_MODEL centralization (all Copilot-backed LLM modules + benchmarks)
Benchmark exception hardening (run_simple_models_benchmark.py)
Tests
All CI checks are green. No issues found — the changes are clean, well-scoped, and the tests provide solid regression coverage. |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
To use Codex here, create a Codex account and connect to github. |
Summary
Two follow-ups intentionally deferred from PR #6 (where I'd noted them as
"speculative
gpt-5.2default-model concerns" and "benchmarkexception-swallowing — better in a dedicated hardening PR"). Branched
from main after fast-forwarding to the latest (post-#7) tip.
1. Centralize
DEFAULT_MODELas the single source of truthProblem. Every Copilot-backed LLM module duplicated the literal
"gpt-5.2"string as its default model.DEFAULT_MODELalreadyexisted in
featcopilot/utils/models.pybut wasn't actually beingused as the source of truth. This made silent drift possible — a
future model bump in one module could lag the others by months.
Fix. Re-export
DEFAULT_MODELfrom the publicfeatcopilot.utilsnamespace and use it everywhere instead of theliteral:
featcopilot/llm/copilot_client.pyCopilotConfig.model,CopilotFeatureClient.__init__featcopilot/llm/code_generator.pyFeatureCodeGenerator.__init__featcopilot/llm/explainer.pyFeatureExplainer.__init__featcopilot/llm/transform_rule_generator.pyTransformRuleGenerator.__init__featcopilot/llm/semantic_engine.pySemanticEngineConfig.model,SemanticEngine.__init__featcopilot/transformers/sklearn_compat.pyAutoFeatureEngineer._create_engine("llm")fallbackbenchmarks/simple_models/run_simple_models_benchmark.pyget_featcopilot_enginesbenchmarks/automl/run_automl_benchmark.pyget_featcopilot_enginesbenchmarks/compare_tools/run_fe_tools_comparison.pyllm_configbuilderBehaviour unchanged —
DEFAULT_MODEL == "gpt-5.2"still. Thisis purely a refactor that eliminates drift risk; a future model change
becomes a single-line edit.
Intentionally NOT changed:
featcopilot/llm/openai_client.pyandlitellm_client.pykeep"gpt-4o"— those backends use a different public modelidentifier and that's by design.
README.md/docs/keep theliteral
"gpt-5.2"string for readability (showing the actualstring a user would copy/paste is more useful than
DEFAULT_MODEL).2. Harden benchmark exception handling
Problem.
benchmarks/simple_models/run_simple_models_benchmark.pyhad three
except Exceptionblocks that silently swallowed errors:score, hiding genuine bugs (a refactor regression raising
AttributeErrorwould look identical to a benign validationfailure).
print + traceback.print_exc, nolog-aggregator visibility.
_resolve_source→ bareexcept Exceptionmasking unexpectederrors as benign "synthetic" bucketing.
except ValueErroraroundscipy.stats.wilcoxonsilently setp_value = 1.0(looked like a real null result).Fix.
except _EXPECTED_FE_FAILURES(recoverableValueError, KeyError, TypeError, RuntimeError, MemoryError, LinAlgError→logger.warning+ baseline fallback) andexcept Exception(unexpected →logger.exceptionwith fulltraceback + baseline fallback so a single bad fold doesn't poison
the dataset run).
fe_failed_folds(list of{seed, fold, error_type, error_message, expected}dicts) andn_fe_failed_foldsto theresults dict so consumers can see when
tabular_best_scoreis afair comparison vs a fallback-padded one.
WARNING(still falls back top_value = 1.0— that's the right user-facing behaviour).logger.exceptioninstead ofprint + traceback.print_exc._resolve_sourcenarrowed to(KeyError, ValueError, TypeError).Tests added
tests/test_utils.py— 3 new tests:test_default_model_exposed_from_utils_package— pins there-export.
test_llm_modules_share_default_model— pins every Copilot-backedLLM module/config defaults to
DEFAULT_MODEL.test_auto_feature_engineer_llm_default_model— pinsAutoFeatureEngineerfalls back toDEFAULT_MODELwhenllm_configlacks the"model"key.tests/test_benchmark_encoding.py— 4 new tests:test_resolve_source_only_swallows_expected_exceptions— provesAttributeErrorpropagates (was previously silently swallowed).test_expected_fe_failures_includes_recoverable_types— pins theexpected-error list (and proves
AttributeError,NameError,ImportErrorare NOT in it).test_benchmark_module_uses_centralized_default_model— pins thellm_configmodel =DEFAULT_MODEL.test_benchmark_logger_is_configured— pins the module-ownedlogger(provesprintwas replaced).Validation
pre-commit run --all-files: clean (black, ruff).