fix: support continuous reward scores (int truncation + falsy float)#20
Open
1s1x wants to merge 3 commits into
Open
fix: support continuous reward scores (int truncation + falsy float)#201s1x wants to merge 3 commits into
1s1x wants to merge 3 commits into
Conversation
int() truncates any float in [0,1) to 0. Replace with float(). Also fix falsy float check in failure detection. Backward compatible with binary hard=0/1.
not r.get("hard") treats non-zero floats as success.
Add explicit float threshold check (< 1e-9).
Backward compatible with binary hard=0/1.
Contributor
|
@1s1x please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
int() truncates smoothed composite scores (0.0-1.0) to 0, making all continuous reward values appear as failures. This broke SkillOpt training pipelines using SmoothedCompositeReward.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
SkillOpt assumes
hardscores are binary (0 or 1). When using continuous reward functions (e.g., composite scores in [0.0, 1.0]), three bugs break the training loop:Bug 1:
int()truncates continuous scores to 0File:
skillopt/engine/trainer.py:377int()truncates any float in [0, 1) to 0._compute_task_type_bucketsalways reportshard=0, making aggregate metrics invisible to the training loop. Epoch-level logging and reflect gradient signal are broken.Bug 2:
not r.get("hard")treats non-zero floats as successFiles:
skillopt/engine/trainer.py:396,1325andskillopt/gradient/reflect.py:493Python truthiness:
not 0.5isFalse, so continuous scores like 0.3480 are treated as success. This prevents the reflect phase from analyzing weak-but-non-zero results for improvement.Validation
Tested with a quant factor optimization adapter using
SmoothedCompositeReward(continuous scores in [0, 1]):Backward Compatibility
Both fixes are fully backward-compatible with existing binary
hard=0/1scoring:float(0)= 0.0,float(1)= 1.0 - identical toint()for binarynot 0=True,not 1=False- threshold< 1e-9has no effect on binaryChanges
skillopt/engine/trainer.py: Fixint()->float()in_compute_task_type_buckets, fix falsy check in_extract_failure_patternsand step buffer loggingskillopt/gradient/reflect.py: Fix falsy check inrun_minibatch_reflectfailure filtering