chore: centralize default LLM model + harden benchmark exception handling by thinkall · Pull Request #8 · thinkall/featcopilot

thinkall · 2026-05-06T10:13:31Z

Summary

Two follow-ups intentionally deferred from PR #6 (where I'd noted them as
"speculative gpt-5.2 default-model concerns" and "benchmark
exception-swallowing — better in a dedicated hardening PR"). Branched
from main after fast-forwarding to the latest (post-#7) tip.

1. Centralize `DEFAULT_MODEL` as the single source of truth

Problem. Every Copilot-backed LLM module duplicated the literal
"gpt-5.2" string as its default model. DEFAULT_MODEL already
existed in featcopilot/utils/models.py but wasn't actually being
used as the source of truth. This made silent drift possible — a
future model bump in one module could lag the others by months.

Fix. Re-export DEFAULT_MODEL from the public
featcopilot.utils namespace and use it everywhere instead of the
literal:

Module	Where the default lives
`featcopilot/llm/copilot_client.py`	`CopilotConfig.model`, `CopilotFeatureClient.__init__`
`featcopilot/llm/code_generator.py`	`FeatureCodeGenerator.__init__`
`featcopilot/llm/explainer.py`	`FeatureExplainer.__init__`
`featcopilot/llm/transform_rule_generator.py`	`TransformRuleGenerator.__init__`
`featcopilot/llm/semantic_engine.py`	`SemanticEngineConfig.model`, `SemanticEngine.__init__`
`featcopilot/transformers/sklearn_compat.py`	`AutoFeatureEngineer._create_engine("llm")` fallback
`benchmarks/simple_models/run_simple_models_benchmark.py`	`get_featcopilot_engines`
`benchmarks/automl/run_automl_benchmark.py`	`get_featcopilot_engines`
`benchmarks/compare_tools/run_fe_tools_comparison.py`	`llm_config` builder

Behaviour unchanged — DEFAULT_MODEL == "gpt-5.2" still. This
is purely a refactor that eliminates drift risk; a future model change
becomes a single-line edit.

Intentionally NOT changed:

featcopilot/llm/openai_client.py and litellm_client.py keep
"gpt-4o" — those backends use a different public model
identifier and that's by design.
User-facing doc examples in README.md / docs/ keep the
literal "gpt-5.2" string for readability (showing the actual
string a user would copy/paste is more useful than DEFAULT_MODEL).

2. Harden benchmark exception handling

Problem. benchmarks/simple_models/run_simple_models_benchmark.py
had three except Exception blocks that silently swallowed errors:

Per-fold FeatCopilot fit/transform → silently fell back to baseline
score, hiding genuine bugs (a refactor regression raising
AttributeError would look identical to a benign validation
failure).
Top-level dataset run → print + traceback.print_exc, no
log-aggregator visibility.
_resolve_source → bare except Exception masking unexpected
errors as benign "synthetic" bucketing.
except ValueError around scipy.stats.wilcoxon silently set
p_value = 1.0 (looked like a real null result).

Fix.

Per-fold FE: split into except _EXPECTED_FE_FAILURES (recoverable
ValueError, KeyError, TypeError, RuntimeError, MemoryError, LinAlgError → logger.warning + baseline fallback) and
except Exception (unexpected → logger.exception with full
traceback + baseline fallback so a single bad fold doesn't poison
the dataset run).
Add fe_failed_folds (list of {seed, fold, error_type, error_message, expected} dicts) and n_fe_failed_folds to the
results dict so consumers can see when tabular_best_score is a
fair comparison vs a fallback-padded one.
Wilcoxon failure now logs at WARNING (still falls back to
p_value = 1.0 — that's the right user-facing behaviour).
Top-level dataset wrapper uses logger.exception instead of
print + traceback.print_exc.
_resolve_source narrowed to (KeyError, ValueError, TypeError).

Tests added

tests/test_utils.py — 3 new tests:

test_default_model_exposed_from_utils_package — pins the
re-export.
test_llm_modules_share_default_model — pins every Copilot-backed
LLM module/config defaults to DEFAULT_MODEL.
test_auto_feature_engineer_llm_default_model — pins
AutoFeatureEngineer falls back to DEFAULT_MODEL when
llm_config lacks the "model" key.

tests/test_benchmark_encoding.py — 4 new tests:

test_resolve_source_only_swallows_expected_exceptions — proves
AttributeError propagates (was previously silently swallowed).
test_expected_fe_failures_includes_recoverable_types — pins the
expected-error list (and proves AttributeError, NameError,
ImportError are NOT in it).
test_benchmark_module_uses_centralized_default_model — pins the
llm_config model = DEFAULT_MODEL.
test_benchmark_logger_is_configured — pins the module-owned
logger (proves print was replaced).

Validation

pre-commit run --all-files: clean (black, ruff).
Full pytest: 924 passed, 2 skipped in 5m28s.

…ling Two follow-ups intentionally deferred from PR #6: 1. **Centralize `DEFAULT_MODEL`** as the single source of truth across the Copilot-backed LLM modules. Previously every module duplicated the literal `"gpt-5.2"` default, which made silent drift possible (one module could lag behind a future model bump). - `featcopilot/utils/__init__.py`: re-export `DEFAULT_MODEL`. - `featcopilot/llm/copilot_client.py`: `CopilotConfig.model` & `CopilotFeatureClient.__init__` now default to `DEFAULT_MODEL`. - `featcopilot/llm/code_generator.py`, `featcopilot/llm/explainer.py`, `featcopilot/llm/transform_rule_generator.py`, `featcopilot/llm/semantic_engine.py` (config + `__init__`): same treatment. - `featcopilot/transformers/sklearn_compat.py`: `AutoFeatureEngineer._create_engine("llm")` falls back to `DEFAULT_MODEL` instead of a hardcoded literal. - Benchmark runners (`simple_models`, `automl`, `compare_tools`) pull the same constant in their `llm_config` builders. Behaviour unchanged: `DEFAULT_MODEL == "gpt-5.2"` still. OpenAI / LiteLLM clients intentionally keep `"gpt-4o"` (they use a different backend with its own public model identifier). 2. **Harden benchmark exception handling** in `benchmarks/simple_models/run_simple_models_benchmark.py`: - Replace the bare `except Exception` around per-fold FeatCopilot fit/transform with a narrowed `_EXPECTED_FE_FAILURES` list (`ValueError, KeyError, TypeError, RuntimeError, MemoryError, LinAlgError`) plus a separate `except Exception` branch that logs the full traceback via `logger.exception`. Genuine bugs (`AttributeError`, `ImportError`, ...) no longer hide behind a silent baseline-fallback. - Record per-fold failures in a new `fe_failed_folds` / `n_fe_failed_folds` field on the results dict so consumers can see when the `tabular_best_score` is a fair comparison vs a fallback-padded one. - Replace `print(...)` with `logger.warning` / `logger.exception` so failures show up in log aggregators / CI summaries. - Wilcoxon `except ValueError` now logs (was silently setting `p_value = 1.0`). - Top-level dataset-loop `except Exception` switched to `logger.exception` (was `print + traceback.print_exc`). - `_resolve_source` narrowed from `except Exception` to `(KeyError, ValueError, TypeError)`. Tests: - `tests/test_utils.py`: 3 new tests pinning `DEFAULT_MODEL` is re-exported from `featcopilot.utils`, every Copilot-backed LLM module / config defaults to `DEFAULT_MODEL`, and `AutoFeatureEngineer.llm_config` falls back to `DEFAULT_MODEL`. - `tests/test_benchmark_encoding.py`: 4 new tests pinning the narrowed exception list, propagation of unexpected errors out of `_resolve_source`, the centralized model in `llm_config`, and the module-owned `logger` attribute. Full pytest: 924 passed, 2 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codecov · 2026-05-06T10:15:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.34%. Comparing base (4a8caa2) to head (e14a078).

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   89.32%   89.34%   +0.01%     
==========================================
  Files          38       38              
  Lines        4358     4364       +6     
==========================================
+ Hits         3893     3899       +6     
  Misses        465      465

Flag	Coverage Δ
unittests	`89.34% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR refactors Copilot-backed LLM components to use a single centralized DEFAULT_MODEL constant (reducing drift risk) and hardens benchmark exception handling so failures are observable while keeping benchmark runs resilient.

Changes:

Re-export DEFAULT_MODEL from featcopilot.utils and replace duplicated "gpt-5.2" literals across Copilot-backed LLM modules, sklearn adapter fallback paths, and benchmark runners.
Improve benchmarks/simple_models/run_simple_models_benchmark.py error handling by distinguishing expected vs unexpected per-fold FE failures, logging stack traces for unexpected errors, and recording per-fold failure metadata in results.
Add regression tests covering default-model centralization and benchmark exception-handling behavior/logging.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`featcopilot/utils/__init__.py`	Re-exports `DEFAULT_MODEL` to make it available from the public `featcopilot.utils` namespace.
`featcopilot/llm/copilot_client.py`	Switches Copilot client config/init defaults to centralized `DEFAULT_MODEL`.
`featcopilot/llm/code_generator.py`	Uses `DEFAULT_MODEL` as the default model for the code generator.
`featcopilot/llm/explainer.py`	Uses `DEFAULT_MODEL` as the default model for the explainer.
`featcopilot/llm/semantic_engine.py`	Updates Pydantic config + engine init default to `DEFAULT_MODEL`.
`featcopilot/llm/transform_rule_generator.py`	Uses `DEFAULT_MODEL` as the default model and updates docstring accordingly.
`featcopilot/transformers/sklearn_compat.py`	Replaces the sklearn adapter LLM fallback model literal with `DEFAULT_MODEL`.
`benchmarks/simple_models/run_simple_models_benchmark.py`	Centralizes benchmark LLM config model and adds structured/logged exception handling with failure tracking.
`benchmarks/automl/run_automl_benchmark.py`	Uses `DEFAULT_MODEL` in benchmark LLM config builder.
`benchmarks/compare_tools/run_fe_tools_comparison.py`	Uses `DEFAULT_MODEL` in benchmark LLM config builder.
`tests/test_utils.py`	Adds tests pinning the `DEFAULT_MODEL` re-export and ensuring consistent LLM-module defaults.
`tests/test_benchmark_encoding.py`	Adds tests pinning narrowed exception swallowing, expected-failure types, centralized model usage, and module logger presence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The previous comment claimed the stdlib `logging` choice was about 'staying usable before featcopilot is installed' — but this PR added a direct `from featcopilot.utils.models import DEFAULT_MODEL` right above the comment, which contradicts that claim. The actual motivation is that `featcopilot.utils.logger` sets `propagate=False` on the `featcopilot.*` logger tree (see `featcopilot/utils/logger.py:20`), which would block benchmark output from reaching root-logger handlers configured by downstream consumers (CI runners, log aggregators, `pytest --log-cli`). Updated the comment to reflect that real motivation; no behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

thinkall · 2026-05-06T11:14:47Z

@copilot @codex please review the whole PR (not just the latest commit).

Round-1 fix (e14a078) addressed the Copilot reviewer thread on
benchmarks/simple_models/run_simple_models_benchmark.py: the
comment justifying logging.getLogger(__name__) previously claimed
this kept the module usable before featcopilot was installed, which
contradicted the new from featcopilot.utils.models import DEFAULT_MODEL line right above it. The comment is now grounded in
the real motivation: featcopilot/utils/logger.py:20 sets
propagate=False, which would block benchmark log output from
reaching root-logger handlers configured by downstream consumers
(CI / log aggregators / pytest --log-cli).

All checks green; please look at the full PR scope:

centralization of DEFAULT_MODEL across 5 LLM modules + sklearn
adapter + 3 benchmark runners
per-fold FE exception narrowing (_EXPECTED_FE_FAILURES vs
logger.exception for unexpected bugs)
fe_failed_folds / n_fe_failed_folds results-dict additions
Wilcoxon failure logging
7 new regression tests

Copilot · 2026-05-06T11:17:57Z

@copilot @codex please review the whole PR (not just the latest commit).

Round-1 fix (e14a078) addressed the Copilot reviewer thread on...

Full PR review complete — here's my assessment of all the changes:

DEFAULT_MODEL centralization (all Copilot-backed LLM modules + benchmarks)

featcopilot/utils/__init__.py — DEFAULT_MODEL correctly added to both the import and __all__, so from featcopilot.utils import DEFAULT_MODEL works cleanly.
copilot_client.py, code_generator.py, explainer.py, transform_rule_generator.py, semantic_engine.py — each imports DEFAULT_MODEL from featcopilot.utils.models and uses it as the parameter default. Docstrings updated to reflect the dynamic default.
sklearn_compat.py — AutoFeatureEngineer._create_engine("llm") fallback uses DEFAULT_MODEL.
All three benchmark runners — get_featcopilot_engines and llm_config builder use DEFAULT_MODEL. openai_client.py and litellm_client.py intentionally keep "gpt-4o" (correct, those are a different backend).

Benchmark exception hardening (run_simple_models_benchmark.py)

_EXPECTED_FE_FAILURES tuple is well-scoped — covers ValueError, KeyError, TypeError, RuntimeError, MemoryError, LinAlgError (all sklearn/pandas/featcopilot validation paths) and deliberately excludes AttributeError, NameError, ImportError (genuine bugs that should surface).
Per-fold split: _EXPECTED_FE_FAILURES → logger.warning + baseline fallback; Exception → logger.exception + baseline fallback — correct pattern. Both branches record into fe_failed_folds with "expected": True/False so downstream consumers can distinguish a "fair" comparison from a fallback-padded one.
n_fe_failed_folds and fe_failed_folds added to the results dict — useful for report consumers.
_resolve_source narrowed to (KeyError, ValueError, TypeError) — matches the documented exception surface of is_real_world.
Wilcoxon ValueError now logs at WARNING with the dataset name and count — still falls back to p_value=1.0 (correct), but no longer looks like a genuine null result.
Top-level except Exception uses logger.exception — no more print + traceback.print_exc.
Stdlib logging.getLogger(__name__) usage with the updated comment (e14a078) is correct: the real motivation is propagate=False on the featcopilot.* tree blocking CI/log-aggregator output.

Tests

test_default_model_exposed_from_utils_package — pins the re-export.
test_llm_modules_share_default_model — covers all 6 Copilot-backed modules/configs end-to-end.
test_auto_feature_engineer_llm_default_model — verifies the sklearn_compat fallback path.
test_resolve_source_only_swallows_expected_exceptions — proves AttributeError now propagates.
test_expected_fe_failures_includes_recoverable_types — pins the _EXPECTED_FE_FAILURES contract.
test_benchmark_module_uses_centralized_default_model — end-to-end check via get_featcopilot_engines.
test_benchmark_logger_is_configured — pins the module-owned logger attribute.

All CI checks are green. No issues found — the changes are clean, well-scoped, and the tests provide solid regression coverage.

chatgpt-codex-connector · 2026-05-06T11:18:01Z

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-06T11:18:08Z

To use Codex here, create a Codex account and connect to github.

Copilot AI review requested due to automatic review settings May 6, 2026 10:13

Copilot started reviewing on behalf of thinkall May 6, 2026 10:14 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated

Copilot started work on behalf of thinkall May 6, 2026 11:14 View session

Copilot finished work on behalf of thinkall May 6, 2026 11:19

jianglibigdata approved these changes May 6, 2026

View reviewed changes

jianglibigdata merged commit bf11981 into main May 6, 2026
8 checks passed

jianglibigdata deleted the chore/centralize-default-model-and-benchmark-hardening branch May 6, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: centralize default LLM model + harden benchmark exception handling#8

chore: centralize default LLM model + harden benchmark exception handling#8
jianglibigdata merged 2 commits into
mainfrom
chore/centralize-default-model-and-benchmark-hardening

thinkall commented May 6, 2026

Uh oh!

codecov Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

thinkall commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thinkall commented May 6, 2026

Summary

1. Centralize DEFAULT_MODEL as the single source of truth

2. Harden benchmark exception handling

Tests added

Validation

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

thinkall commented May 6, 2026

Uh oh!

Copilot AI commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. Centralize `DEFAULT_MODEL` as the single source of truth

codecov Bot commented May 6, 2026 •

edited

Loading