@@ -28,8 +28,9 @@ conda install -c conda-forge comlrl
2828 ` dataset.train_split ` and ` dataset.eval_split ` (e.g., ` test[:50] ` and ` test[50:] ` ).
2929- ** Subsetting** : if a split name is missing (e.g., ClassEval only has ` test ` ),
3030 the loader falls back to the first available split before slicing.
31- - ** Prompting** : prompts include the sanitized class skeleton, explicit method names per
32- agent, and any collaboration instructions.
31+ - ** Prompting** : prompts include the sanitized class skeleton plus per-agent method
32+ assignments. The default strategy assigns 1-parameter methods to agent 0 and all other
33+ methods to agent 1.
3334- ** Testing** : reward code merges agent completions back into the skeleton and runs the
3435 provided unit tests inside a temporary directory to isolate state.
3536
@@ -41,25 +42,21 @@ Key sections in `configs/magrpo_classeval_config.yaml`:
4142 kwargs, and device mapping.
4243- ` dataset ` : dataset name and split strings (` train_split ` , ` eval_split ` ) for
4344 ClassEval sub-slices or local mirrors.
44- - ` external ` : determines the feedback mode. ` token_report ` summarizes syntax/tests at each
45- turn; other modes replicate the options documented in the code-generation README
46- (` plain ` , ` level_feedback ` , ` group_feedback ` , ` personal_feedback ` , ` personal_detailed_feedback ` ,
47- ` passed ` , ` level_passed ` ).
45+ - ` external ` : feedback configuration (use ` code_feedback ` for syntax/test diagnostics).
4846- ` magrpo ` : forwarded to ` comlrl.trainers.magrpo.MAGRPOTrainer ` . Includes collaboration
49- (` num_agents ` , TAKE_JOB self-select ), sampling settings (` num_generations ` , ` num_turns ` ,
47+ (` num_agents ` , param-count assignment ), sampling settings (` num_generations ` , ` num_turns ` ,
5048 temperature/top_p), rollout buffering (` rollout_buffer_size ` ), optimization
5149 hyperparameters, and IO controls.
52- - ` output ` : persistence knobs (save final model, keep tmp dirs); environment variables such
53- as ` CLASSEVAL_TMP_BASE ` are derived from this section to colocate temp files per job .
50+ - ` reward_processor ` : optional post-processing for rewards (scale, shift).
51+ - ` output ` : persistence knobs (save final model, output paths, verbose debug prints) .
5452
5553## Rewards, Logging, and Evaluation
5654
5755- ` rewards/CE_reward.py ` computes structured rewards:
58- - ` lv1 ` : coverage of unique methods completed.
59- - ` lv2 ` : penalizes under/over-allocation of total method picks.
60- - ` lv3 ` : balance term encouraging an even workload across agents.
61- - ` lv4 ` /` lv5 ` : syntax + unit-test bonuses (reported for analysis; syntax/test failures
62- short-circuit the run where applicable).
56+ - ` lv1 ` : syntax score proportional to valid method outputs (range [ 0, 2] ).
57+ - ` lv2 ` : unit-test bonus based on pass rate (passed/total), scaled to [ 0, 4] .
58+ - ` lv3 ` : overlap penalty normalized by total methods (range [ -1, 0] ).
59+ - reward shift: optional post-processing shift via ` reward_processor.shift ` .
6360- Tests execute inside per-sample temporary directories to avoid polluted state and are
6461 automatically truncated on timeout.
6562- Loggers are inherited from CoMLRL. Enable Weights & Biases by filling ` wandb.entity `
0 commit comments