fix(bertscore): cap model_max_length to prevent Rust tokenizer OverflowError#756
Open
xodn348 wants to merge 1 commit intohuggingface:mainfrom
Open
fix(bertscore): cap model_max_length to prevent Rust tokenizer OverflowError#756xodn348 wants to merge 1 commit intohuggingface:mainfrom
xodn348 wants to merge 1 commit intohuggingface:mainfrom
Conversation
…d OverflowError Models such as microsoft/deberta-xlarge-mnli omit model_max_length from their tokenizer config. transformers fills the gap with a huge sentinel (~1e30), which bert_score then passes to the Rust tokenizers backend via enable_truncation(), causing OverflowError: int too big to convert. Add an explicit cap in BERTScore._compute(): if the caller supplies max_length that value is applied directly; otherwise any sentinel larger than sys.maxsize is clamped to 512. Both paths are covered by new unit tests that mock the scorer so no model download is required. Fixes huggingface#739
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BERTScore crashes with
OverflowError: int too big to convertwhen the model being evaluated (e.g.microsoft/deberta-xlarge-mnli) does not declaremodel_max_lengthin its tokenizer config. In that case transformers fills the missing field with a huge sentinel value (~1e30). Whenbert_scorelater passes this value to the Rusttokenizersbackend viaenable_truncation(), the backend overflows becauseusize/u32cannot hold an integer of that magnitude. The error is silent from the user's perspective — the metric simply crashes without a clear explanation.The fix adds a guard inside
BERTScore._compute(): right after theBERTScoreris created (or retrieved from cache), the metric inspects the tokenizer'smodel_max_length. If the caller has explicitly setmax_length, that value is applied directly. If not, and the tokenizer carries a sentinel larger thansys.maxsize, it is clamped to 512 — a value that is safe for all standard BERT-family models. This keeps the existing behaviour unchanged for every model that properly declares its max length, and silently repairs the broken case without requiring a bert-score upstream change.A new
max_lengthparameter is also exposed in_compute()so that advanced users who need explicit control over the truncation length (e.g. long-document models) can set it directly, independent of whatever the tokenizer config says.Issue
Fixes #739
Local verification
Risk
The guard only fires when
model_max_length > sys.maxsize(i.e. the sentinel case) or when the user explicitly passesmax_length. All models that define a realmodel_max_lengthin their config continue to use their declared value unchanged. The clamped default of 512 is conservative for BERT-family models (max is typically 512) and matches the bert-score project's own hard-coded defaults in its baseline files. Callers who need a different truncation length for a specific model can override it with the newmax_lengthparameter.