Skip to content

Commit 9241b8a

Browse files
Merge pull request #5 from OpenMLRL/new
Feature: reconstruct and design new reward system
2 parents 52fbe8b + 710fedc commit 9241b8a

20 files changed

Lines changed: 2326 additions & 1092 deletions

README.md

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,9 @@ conda install -c conda-forge comlrl
2828
`dataset.train_split` and `dataset.eval_split` (e.g., `test[:50]` and `test[50:]`).
2929
- **Subsetting**: if a split name is missing (e.g., ClassEval only has `test`),
3030
the loader falls back to the first available split before slicing.
31-
- **Prompting**: prompts include the sanitized class skeleton, explicit method names per
32-
agent, and any collaboration instructions.
31+
- **Prompting**: prompts include the sanitized class skeleton plus per-agent method
32+
assignments. The default strategy assigns 1-parameter methods to agent 0 and all other
33+
methods to agent 1.
3334
- **Testing**: reward code merges agent completions back into the skeleton and runs the
3435
provided unit tests inside a temporary directory to isolate state.
3536

@@ -41,25 +42,21 @@ Key sections in `configs/magrpo_classeval_config.yaml`:
4142
kwargs, and device mapping.
4243
- `dataset`: dataset name and split strings (`train_split`, `eval_split`) for
4344
ClassEval sub-slices or local mirrors.
44-
- `external`: determines the feedback mode. `token_report` summarizes syntax/tests at each
45-
turn; other modes replicate the options documented in the code-generation README
46-
(`plain`, `level_feedback`, `group_feedback`, `personal_feedback`, `personal_detailed_feedback`,
47-
`passed`, `level_passed`).
45+
- `external`: feedback configuration (use `code_feedback` for syntax/test diagnostics).
4846
- `magrpo`: forwarded to `comlrl.trainers.magrpo.MAGRPOTrainer`. Includes collaboration
49-
(`num_agents`, TAKE_JOB self-select), sampling settings (`num_generations`, `num_turns`,
47+
(`num_agents`, param-count assignment), sampling settings (`num_generations`, `num_turns`,
5048
temperature/top_p), rollout buffering (`rollout_buffer_size`), optimization
5149
hyperparameters, and IO controls.
52-
- `output`: persistence knobs (save final model, keep tmp dirs); environment variables such
53-
as `CLASSEVAL_TMP_BASE` are derived from this section to colocate temp files per job.
50+
- `reward_processor`: optional post-processing for rewards (scale, shift).
51+
- `output`: persistence knobs (save final model, output paths, verbose debug prints).
5452

5553
## Rewards, Logging, and Evaluation
5654

5755
- `rewards/CE_reward.py` computes structured rewards:
58-
- `lv1`: coverage of unique methods completed.
59-
- `lv2`: penalizes under/over-allocation of total method picks.
60-
- `lv3`: balance term encouraging an even workload across agents.
61-
- `lv4`/`lv5`: syntax + unit-test bonuses (reported for analysis; syntax/test failures
62-
short-circuit the run where applicable).
56+
- `lv1`: syntax score proportional to valid method outputs (range [0, 2]).
57+
- `lv2`: unit-test bonus based on pass rate (passed/total), scaled to [0, 4].
58+
- `lv3`: overlap penalty normalized by total methods (range [-1, 0]).
59+
- reward shift: optional post-processing shift via `reward_processor.shift`.
6360
- Tests execute inside per-sample temporary directories to avoid polluted state and are
6461
automatically truncated on timeout.
6562
- Loggers are inherited from CoMLRL. Enable Weights & Biases by filling `wandb.entity`

configs/iac_classeval_config.yaml

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
model:
2+
name: Qwen/Qwen3-4B-Instruct-2507
3+
type: qwen
4+
temperature: 0.6
5+
top_p: 0.6
6+
max_length: 2048
7+
tokenizer_kwargs: {}
8+
model_kwargs:
9+
trust_remote_code: true
10+
device_map: auto
11+
torch_dtype: bfloat16
12+
13+
dataset:
14+
name: FudanSELab/ClassEval
15+
type: classeval
16+
train_split: test[:66]
17+
eval_split: test[66:82]
18+
19+
output:
20+
base_dir: output
21+
save_final_model: false
22+
save_path: output/final_model
23+
verbose: false
24+
25+
external:
26+
mode: code_feedback
27+
original_prompt: true
28+
previous_response: true
29+
30+
patches:
31+
generation_memory: true
32+
single_agent_returns: true
33+
34+
iac:
35+
num_turns: 2
36+
num_train_epochs: 40
37+
per_device_train_batch_size: 1
38+
rollout_buffer_size: 2
39+
actor_learning_rate: 5e-6
40+
critic_learning_rate: 5e-6
41+
value_loss_coef: 0.6
42+
value_clip_range: 0.2
43+
max_new_tokens: 600
44+
temperature: 0.6
45+
top_p: 0.6
46+
top_k: null
47+
num_agents: 2
48+
num_return_sequences: 1
49+
use_separate_critic: true
50+
critic_model: null
51+
discount: 0.9
52+
early_termination_threshold: -0.2
53+
eval_interval: 20
54+
eval_num_samples: 4
55+
logging_steps: 1
56+
57+
reward_processor:
58+
enabled: true
59+
scale_factor: 1.0
60+
shift: -6.0
61+
62+
wandb:
63+
project: classeval_dev
64+
entity: null
65+
name: codecompletion_classeval_iac
66+
dir: output
67+
tags: ["iac", "classeval", "code-completion", "turns_2"]

configs/maac_classeval_config.yaml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
model:
2+
name: Qwen/Qwen3-4B-Instruct-2507
3+
type: qwen
4+
temperature: 0.6
5+
top_p: 0.6
6+
max_length: 2048
7+
tokenizer_kwargs: {}
8+
model_kwargs:
9+
trust_remote_code: true
10+
device_map: auto
11+
torch_dtype: bfloat16
12+
13+
dataset:
14+
name: FudanSELab/ClassEval
15+
type: classeval
16+
train_split: test[:66]
17+
eval_split: test[66:82]
18+
19+
output:
20+
base_dir: output
21+
save_final_model: false
22+
save_path: output/final_model
23+
verbose: false
24+
25+
external:
26+
mode: code_feedback
27+
original_prompt: true
28+
previous_response: true
29+
30+
patches:
31+
generation_memory: true
32+
single_agent_returns: true
33+
34+
maac:
35+
num_turns: 2
36+
critic_type: v
37+
num_train_epochs: 40
38+
per_device_train_batch_size: 1
39+
rollout_buffer_size: 2
40+
actor_learning_rate: 5e-6
41+
critic_learning_rate: 5e-6
42+
value_loss_coef: 0.6
43+
max_new_tokens: 600
44+
temperature: 0.6
45+
top_p: 0.6
46+
top_k: null
47+
num_agents: 2
48+
num_return_sequences: 1
49+
critic_model: null
50+
discount: 0.9
51+
early_termination_threshold: -0.2
52+
eval_interval: 20
53+
eval_num_samples: 2
54+
logging_steps: 1
55+
56+
reward_processor:
57+
enabled: true
58+
scale_factor: 1.0
59+
shift: -6.0
60+
61+
wandb:
62+
project: classeval_dev
63+
entity: null
64+
name: codecompletion_classeval_maac
65+
dir: output
66+
tags: ["maac", "classeval", "code-completion", "turns_2"]

configs/magrpo_classeval_config.yaml

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
model:
2-
name: Qwen/Qwen2.5-Coder-3B-Instruct
2+
name: Qwen/Qwen3-4B-Instruct-2507
33
type: qwen
4-
temperature: 0.25
4+
temperature: 0.6
55
top_p: 0.6
66
max_length: 2048
77
tokenizer_kwargs: {}
@@ -21,37 +21,43 @@ output:
2121
save_final_model: false
2222
save_path: output/final_model
2323
verbose: false
24-
keep_tmp: true
25-
tmp_base_dir: output/tmp
2624

2725
external:
2826
mode: code_feedback
29-
sandbox_slice: null
3027
original_prompt: true
3128
previous_response: true
3229

30+
patches:
31+
generation_memory: true
32+
single_agent_returns: true
33+
3334
magrpo:
3435
num_turns: 2
35-
num_train_epochs: 8
36+
num_train_epochs: 13
3637
per_device_train_batch_size: 1
3738
rollout_buffer_size: 1
3839
learning_rate: 1e-5
39-
eval_interval: 4
40+
eval_interval: 10
4041
eval_num_samples: 4
41-
num_generations: 3
42+
num_generations: 2
4243
max_new_tokens: 600
43-
temperature: 0.25
44+
temperature: 0.6
4445
top_p: 0.6
4546
top_k: null
4647
joint_mode: aligned
4748
num_agents: 2
4849
discount: 0.9
49-
termination_threshold: 7.8
50-
logging_steps: 50
50+
termination_threshold: -0.2
51+
logging_steps: 1
5152
save_steps: 200
5253
normalize_advantage: false
5354
epsilon_clip: null
5455

56+
reward_processor:
57+
enabled: true
58+
scale_factor: 1.0
59+
shift: -6.0
60+
5561
wandb:
5662
project: classeval_dev
5763
entity: null

0 commit comments

Comments
 (0)