Skip to content

chore: centralize default LLM model + harden benchmark exception handling#8

Merged
jianglibigdata merged 2 commits into
mainfrom
chore/centralize-default-model-and-benchmark-hardening
May 6, 2026
Merged

chore: centralize default LLM model + harden benchmark exception handling#8
jianglibigdata merged 2 commits into
mainfrom
chore/centralize-default-model-and-benchmark-hardening

Conversation

@thinkall
Copy link
Copy Markdown
Owner

@thinkall thinkall commented May 6, 2026

Summary

Two follow-ups intentionally deferred from PR #6 (where I'd noted them as
"speculative gpt-5.2 default-model concerns" and "benchmark
exception-swallowing — better in a dedicated hardening PR"). Branched
from main after fast-forwarding to the latest (post-#7) tip.

1. Centralize DEFAULT_MODEL as the single source of truth

Problem. Every Copilot-backed LLM module duplicated the literal
"gpt-5.2" string as its default model. DEFAULT_MODEL already
existed in featcopilot/utils/models.py but wasn't actually being
used as the source of truth. This made silent drift possible — a
future model bump in one module could lag the others by months.

Fix. Re-export DEFAULT_MODEL from the public
featcopilot.utils namespace and use it everywhere instead of the
literal:

Module Where the default lives
featcopilot/llm/copilot_client.py CopilotConfig.model, CopilotFeatureClient.__init__
featcopilot/llm/code_generator.py FeatureCodeGenerator.__init__
featcopilot/llm/explainer.py FeatureExplainer.__init__
featcopilot/llm/transform_rule_generator.py TransformRuleGenerator.__init__
featcopilot/llm/semantic_engine.py SemanticEngineConfig.model, SemanticEngine.__init__
featcopilot/transformers/sklearn_compat.py AutoFeatureEngineer._create_engine("llm") fallback
benchmarks/simple_models/run_simple_models_benchmark.py get_featcopilot_engines
benchmarks/automl/run_automl_benchmark.py get_featcopilot_engines
benchmarks/compare_tools/run_fe_tools_comparison.py llm_config builder

Behaviour unchangedDEFAULT_MODEL == "gpt-5.2" still. This
is purely a refactor that eliminates drift risk; a future model change
becomes a single-line edit.

Intentionally NOT changed:

  • featcopilot/llm/openai_client.py and litellm_client.py keep
    "gpt-4o" — those backends use a different public model
    identifier and that's by design.
  • User-facing doc examples in README.md / docs/ keep the
    literal "gpt-5.2" string for readability (showing the actual
    string a user would copy/paste is more useful than DEFAULT_MODEL).

2. Harden benchmark exception handling

Problem. benchmarks/simple_models/run_simple_models_benchmark.py
had three except Exception blocks that silently swallowed errors:

  • Per-fold FeatCopilot fit/transform → silently fell back to baseline
    score, hiding genuine bugs (a refactor regression raising
    AttributeError would look identical to a benign validation
    failure).
  • Top-level dataset run → print + traceback.print_exc, no
    log-aggregator visibility.
  • _resolve_source → bare except Exception masking unexpected
    errors as benign "synthetic" bucketing.
  • except ValueError around scipy.stats.wilcoxon silently set
    p_value = 1.0 (looked like a real null result).

Fix.

  • Per-fold FE: split into except _EXPECTED_FE_FAILURES (recoverable
    ValueError, KeyError, TypeError, RuntimeError, MemoryError, LinAlgErrorlogger.warning + baseline fallback) and
    except Exception (unexpected → logger.exception with full
    traceback + baseline fallback so a single bad fold doesn't poison
    the dataset run).
  • Add fe_failed_folds (list of {seed, fold, error_type, error_message, expected} dicts) and n_fe_failed_folds to the
    results dict so consumers can see when tabular_best_score is a
    fair comparison vs a fallback-padded one.
  • Wilcoxon failure now logs at WARNING (still falls back to
    p_value = 1.0 — that's the right user-facing behaviour).
  • Top-level dataset wrapper uses logger.exception instead of
    print + traceback.print_exc.
  • _resolve_source narrowed to (KeyError, ValueError, TypeError).

Tests added

tests/test_utils.py — 3 new tests:

  • test_default_model_exposed_from_utils_package — pins the
    re-export.
  • test_llm_modules_share_default_model — pins every Copilot-backed
    LLM module/config defaults to DEFAULT_MODEL.
  • test_auto_feature_engineer_llm_default_model — pins
    AutoFeatureEngineer falls back to DEFAULT_MODEL when
    llm_config lacks the "model" key.

tests/test_benchmark_encoding.py — 4 new tests:

  • test_resolve_source_only_swallows_expected_exceptions — proves
    AttributeError propagates (was previously silently swallowed).
  • test_expected_fe_failures_includes_recoverable_types — pins the
    expected-error list (and proves AttributeError, NameError,
    ImportError are NOT in it).
  • test_benchmark_module_uses_centralized_default_model — pins the
    llm_config model = DEFAULT_MODEL.
  • test_benchmark_logger_is_configured — pins the module-owned
    logger (proves print was replaced).

Validation

  • pre-commit run --all-files: clean (black, ruff).
  • Full pytest: 924 passed, 2 skipped in 5m28s.

…ling

Two follow-ups intentionally deferred from PR #6:

1. **Centralize `DEFAULT_MODEL`** as the single source of truth across
   the Copilot-backed LLM modules. Previously every module duplicated the
   literal `"gpt-5.2"` default, which made silent drift possible (one
   module could lag behind a future model bump).

   - `featcopilot/utils/__init__.py`: re-export `DEFAULT_MODEL`.
   - `featcopilot/llm/copilot_client.py`: `CopilotConfig.model` &
     `CopilotFeatureClient.__init__` now default to `DEFAULT_MODEL`.
   - `featcopilot/llm/code_generator.py`,
     `featcopilot/llm/explainer.py`,
     `featcopilot/llm/transform_rule_generator.py`,
     `featcopilot/llm/semantic_engine.py` (config + `__init__`):
     same treatment.
   - `featcopilot/transformers/sklearn_compat.py`:
     `AutoFeatureEngineer._create_engine("llm")` falls back to
     `DEFAULT_MODEL` instead of a hardcoded literal.
   - Benchmark runners (`simple_models`, `automl`, `compare_tools`)
     pull the same constant in their `llm_config` builders.

   Behaviour unchanged: `DEFAULT_MODEL == "gpt-5.2"` still.
   OpenAI / LiteLLM clients intentionally keep `"gpt-4o"` (they use a
   different backend with its own public model identifier).

2. **Harden benchmark exception handling** in
   `benchmarks/simple_models/run_simple_models_benchmark.py`:

   - Replace the bare `except Exception` around per-fold FeatCopilot
     fit/transform with a narrowed `_EXPECTED_FE_FAILURES` list
     (`ValueError, KeyError, TypeError, RuntimeError, MemoryError,
     LinAlgError`) plus a separate `except Exception` branch that
     logs the full traceback via `logger.exception`. Genuine bugs
     (`AttributeError`, `ImportError`, ...) no longer hide behind a
     silent baseline-fallback.
   - Record per-fold failures in a new `fe_failed_folds` /
     `n_fe_failed_folds` field on the results dict so consumers can
     see when the `tabular_best_score` is a fair comparison vs a
     fallback-padded one.
   - Replace `print(...)` with `logger.warning` / `logger.exception`
     so failures show up in log aggregators / CI summaries.
   - Wilcoxon `except ValueError` now logs (was silently setting
     `p_value = 1.0`).
   - Top-level dataset-loop `except Exception` switched to
     `logger.exception` (was `print + traceback.print_exc`).
   - `_resolve_source` narrowed from `except Exception` to
     `(KeyError, ValueError, TypeError)`.

Tests:
- `tests/test_utils.py`: 3 new tests pinning `DEFAULT_MODEL` is
  re-exported from `featcopilot.utils`, every Copilot-backed LLM
  module / config defaults to `DEFAULT_MODEL`, and
  `AutoFeatureEngineer.llm_config` falls back to `DEFAULT_MODEL`.
- `tests/test_benchmark_encoding.py`: 4 new tests pinning the
  narrowed exception list, propagation of unexpected errors out of
  `_resolve_source`, the centralized model in `llm_config`, and the
  module-owned `logger` attribute.

Full pytest: 924 passed, 2 skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 6, 2026 10:13
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.34%. Comparing base (4a8caa2) to head (e14a078).

Additional details and impacted files
@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   89.32%   89.34%   +0.01%     
==========================================
  Files          38       38              
  Lines        4358     4364       +6     
==========================================
+ Hits         3893     3899       +6     
  Misses        465      465              
Flag Coverage Δ
unittests 89.34% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Copilot-backed LLM components to use a single centralized DEFAULT_MODEL constant (reducing drift risk) and hardens benchmark exception handling so failures are observable while keeping benchmark runs resilient.

Changes:

  • Re-export DEFAULT_MODEL from featcopilot.utils and replace duplicated "gpt-5.2" literals across Copilot-backed LLM modules, sklearn adapter fallback paths, and benchmark runners.
  • Improve benchmarks/simple_models/run_simple_models_benchmark.py error handling by distinguishing expected vs unexpected per-fold FE failures, logging stack traces for unexpected errors, and recording per-fold failure metadata in results.
  • Add regression tests covering default-model centralization and benchmark exception-handling behavior/logging.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
featcopilot/utils/__init__.py Re-exports DEFAULT_MODEL to make it available from the public featcopilot.utils namespace.
featcopilot/llm/copilot_client.py Switches Copilot client config/init defaults to centralized DEFAULT_MODEL.
featcopilot/llm/code_generator.py Uses DEFAULT_MODEL as the default model for the code generator.
featcopilot/llm/explainer.py Uses DEFAULT_MODEL as the default model for the explainer.
featcopilot/llm/semantic_engine.py Updates Pydantic config + engine init default to DEFAULT_MODEL.
featcopilot/llm/transform_rule_generator.py Uses DEFAULT_MODEL as the default model and updates docstring accordingly.
featcopilot/transformers/sklearn_compat.py Replaces the sklearn adapter LLM fallback model literal with DEFAULT_MODEL.
benchmarks/simple_models/run_simple_models_benchmark.py Centralizes benchmark LLM config model and adds structured/logged exception handling with failure tracking.
benchmarks/automl/run_automl_benchmark.py Uses DEFAULT_MODEL in benchmark LLM config builder.
benchmarks/compare_tools/run_fe_tools_comparison.py Uses DEFAULT_MODEL in benchmark LLM config builder.
tests/test_utils.py Adds tests pinning the DEFAULT_MODEL re-export and ensuring consistent LLM-module defaults.
tests/test_benchmark_encoding.py Adds tests pinning narrowed exception swallowing, expected-failure types, centralized model usage, and module logger presence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarks/simple_models/run_simple_models_benchmark.py Outdated
The previous comment claimed the stdlib `logging` choice was about
'staying usable before featcopilot is installed' — but this PR added a
direct `from featcopilot.utils.models import DEFAULT_MODEL` right
above the comment, which contradicts that claim.

The actual motivation is that `featcopilot.utils.logger` sets
`propagate=False` on the `featcopilot.*` logger tree (see
`featcopilot/utils/logger.py:20`), which would block benchmark
output from reaching root-logger handlers configured by downstream
consumers (CI runners, log aggregators, `pytest --log-cli`).

Updated the comment to reflect that real motivation; no behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@thinkall
Copy link
Copy Markdown
Owner Author

thinkall commented May 6, 2026

@copilot @codex please review the whole PR (not just the latest commit).

Round-1 fix (e14a078) addressed the Copilot reviewer thread on
benchmarks/simple_models/run_simple_models_benchmark.py: the
comment justifying logging.getLogger(__name__) previously claimed
this kept the module usable before featcopilot was installed, which
contradicted the new from featcopilot.utils.models import DEFAULT_MODEL line right above it. The comment is now grounded in
the real motivation: featcopilot/utils/logger.py:20 sets
propagate=False, which would block benchmark log output from
reaching root-logger handlers configured by downstream consumers
(CI / log aggregators / pytest --log-cli).

All checks green; please look at the full PR scope:

  • centralization of DEFAULT_MODEL across 5 LLM modules + sklearn
    adapter + 3 benchmark runners
  • per-fold FE exception narrowing (_EXPECTED_FE_FAILURES vs
    logger.exception for unexpected bugs)
  • fe_failed_folds / n_fe_failed_folds results-dict additions
  • Wilcoxon failure logging
  • 7 new regression tests

Copy link
Copy Markdown
Contributor

Copilot AI commented May 6, 2026

@copilot @codex please review the whole PR (not just the latest commit).

Round-1 fix (e14a078) addressed the Copilot reviewer thread on...

Full PR review complete — here's my assessment of all the changes:

DEFAULT_MODEL centralization (all Copilot-backed LLM modules + benchmarks)

  • featcopilot/utils/__init__.pyDEFAULT_MODEL correctly added to both the import and __all__, so from featcopilot.utils import DEFAULT_MODEL works cleanly.
  • copilot_client.py, code_generator.py, explainer.py, transform_rule_generator.py, semantic_engine.py — each imports DEFAULT_MODEL from featcopilot.utils.models and uses it as the parameter default. Docstrings updated to reflect the dynamic default.
  • sklearn_compat.pyAutoFeatureEngineer._create_engine("llm") fallback uses DEFAULT_MODEL.
  • All three benchmark runners — get_featcopilot_engines and llm_config builder use DEFAULT_MODEL. openai_client.py and litellm_client.py intentionally keep "gpt-4o" (correct, those are a different backend).

Benchmark exception hardening (run_simple_models_benchmark.py)

  • _EXPECTED_FE_FAILURES tuple is well-scoped — covers ValueError, KeyError, TypeError, RuntimeError, MemoryError, LinAlgError (all sklearn/pandas/featcopilot validation paths) and deliberately excludes AttributeError, NameError, ImportError (genuine bugs that should surface).
  • Per-fold split: _EXPECTED_FE_FAILURES → logger.warning + baseline fallback; Exception → logger.exception + baseline fallback — correct pattern. Both branches record into fe_failed_folds with "expected": True/False so downstream consumers can distinguish a "fair" comparison from a fallback-padded one.
  • n_fe_failed_folds and fe_failed_folds added to the results dict — useful for report consumers.
  • _resolve_source narrowed to (KeyError, ValueError, TypeError) — matches the documented exception surface of is_real_world.
  • Wilcoxon ValueError now logs at WARNING with the dataset name and count — still falls back to p_value=1.0 (correct), but no longer looks like a genuine null result.
  • Top-level except Exception uses logger.exception — no more print + traceback.print_exc.
  • Stdlib logging.getLogger(__name__) usage with the updated comment (e14a078) is correct: the real motivation is propagate=False on the featcopilot.* tree blocking CI/log-aggregator output.

Tests

  • test_default_model_exposed_from_utils_package — pins the re-export.
  • test_llm_modules_share_default_model — covers all 6 Copilot-backed modules/configs end-to-end.
  • test_auto_feature_engineer_llm_default_model — verifies the sklearn_compat fallback path.
  • test_resolve_source_only_swallows_expected_exceptions — proves AttributeError now propagates.
  • test_expected_fe_failures_includes_recoverable_types — pins the _EXPECTED_FE_FAILURES contract.
  • test_benchmark_module_uses_centralized_default_model — end-to-end check via get_featcopilot_engines.
  • test_benchmark_logger_is_configured — pins the module-owned logger attribute.

All CI checks are green. No issues found — the changes are clean, well-scoped, and the tests provide solid regression coverage.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@jianglibigdata jianglibigdata merged commit bf11981 into main May 6, 2026
8 checks passed
@jianglibigdata jianglibigdata deleted the chore/centralize-default-model-and-benchmark-hardening branch May 6, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants