fix: CI compatibility fixes (HF_TOKEN, arena-hard migration, datasets 4.8.5) by yoavkatz · Pull Request #1966 · IBM/unitxt

yoavkatz · 2026-05-14T11:23:14Z

Summary

Fixes huggingface-cli: command not found error in the catalog_preparation workflow by replacing the explicit huggingface-cli login step with the HF_TOKEN environment variable
Migrates all arena-hard cards from the defunct lmsys/arena-hard-browser HF space to its replacement lmarena-ai/arena-hard-viewer (revision 56c7614):
- both_games_gpt4_judge
- first_game_only_gpt4_judge
- both_games_mean_judgment_gpt4_judge
Updates processing steps in prepare/cards/arena_hard/common.py to handle the new data format (prompt instead of turns/0/content, messages/1/content/answer instead of choices/0/turns/0/content, uid instead of question_id)
Fixes WeightedWinRateCorrelation metric bug where pearsonr/spearmanr failed with newer scipy/numpy due to object dtype columns

Test plan

Verify arena-hard card loads successfully from new HF space (12,990 examples)
Verify the catalog_preparation workflow passes on this PR
Confirm HuggingFace-authenticated dataset access works correctly

🤖 Generated with Claude Code

…reparation CI The huggingface-cli command was not found on PATH in CI, causing the login step to fail. Using the HF_TOKEN environment variable is the recommended approach for CI and avoids PATH issues entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

The newer datasets library changed as_dataset() to accept fewer positional arguments. Pass all arguments as keyword arguments for forward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

…WeightedWinRateCorrelation metric The old lmsys/arena-hard-browser HF space is no longer available. This migrates to the replacement space lmarena-ai/arena-hard-viewer with adapted processing steps for its different data format (flat prompt field, messages-based answers, uid instead of question_id). Also fixes a bug in WeightedWinRateCorrelation where pd.DataFrame columns initialized as object dtype caused scipy pearsonr to fail with newer numpy/scipy versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

…odels Models like bloom-560M default to float16, causing numerical overflow in attention computations with padded inputs. Forcing float32 ensures stable perplexity scores regardless of model's default dtype. Signed-off-by: Yoav Katz <yoavkatz@il.ibm.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

…s>=4.8.5 The `datasets` library removed `run_post_process` and `verification_mode` parameters from `DatasetBuilder.as_dataset()` in version 4.8.5. These parameters were already non-functional in 4.8.4 (the `_post_process` method and verification logic had been removed from the implementation), but 4.8.5 cleaned up the signature to match, causing a TypeError. - Remove `run_post_process=False` and `verification_mode="no_checks"` from the call site in api.py - Remove both parameters from the Dataset.as_dataset() override signature and the super() call in dataset.py No behavioral change: post-processing and verification were already no-ops in recent datasets versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

…port bug Python 3.10's assertWarns() iterates sys.modules and triggers transformers' lazy loader to import aria.image_processing_aria which requires torchvision. Using warnings.catch_warnings(record=True) avoids this module iteration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

…rror The tokenizers Rust backend (>=0.22) now validates integer sizes, causing DeBERTa's absurd model_max_length (1e30) to overflow. Use BERTScorer directly and cap model_max_length to the model's max_position_embeddings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

Update first_game_only and both_games_mean_judgment cards to use the new HF space, matching the migration done for both_games_gpt4_judge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

BertScore scorer.score() hangs silently in CI. Adding verbose=True will show per-batch progress to identify where it stalls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

Commenting out test_card calls that hang in CI to determine if the issue is specific to these cards or affects all BertScore usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

yoavkatz force-pushed the fix/catalog-prep-hf-login branch from 20a824b to ba9460d Compare May 14, 2026 11:38

yoavkatz and others added 2 commits May 14, 2026 14:50

yoavkatz force-pushed the fix/catalog-prep-hf-login branch from ba9460d to 7c881da Compare May 14, 2026 12:02

yoavkatz changed the title ~~fix: Replace huggingface-cli login with HF_TOKEN env var in CI~~ fix: Replace huggingface-cli login with HF_TOKEN env var and migrate arena-hard to new HF space May 18, 2026

yoavkatz changed the title ~~fix: Replace huggingface-cli login with HF_TOKEN env var and migrate arena-hard to new HF space~~ fix: CI compatibility fixes (HF_TOKEN, arena-hard migration, datasets 4.8.5) May 18, 2026

yoavkatz force-pushed the fix/catalog-prep-hf-login branch from 700318a to 9a84c0b Compare May 18, 2026 09:20

yoavkatz and others added 6 commits May 18, 2026 12:38

fix: Log each preparation file at CRITICAL level for CI visibility

292ab92

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

fix: Enable verbose logging in BertScore to debug CI hang

43f0cea

BertScore scorer.score() hangs silently in CI. Adding verbose=True will show per-batch progress to identify where it stalls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yoav Katz <katz@il.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CI compatibility fixes (HF_TOKEN, arena-hard migration, datasets 4.8.5)#1966

fix: CI compatibility fixes (HF_TOKEN, arena-hard migration, datasets 4.8.5)#1966
yoavkatz wants to merge 11 commits into
mainfrom
fix/catalog-prep-hf-login

yoavkatz commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yoavkatz commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yoavkatz commented May 14, 2026 •

edited

Loading