Skip to content

Code review skill produces wildly different results between Claude and Codex #835

@alessandrotesoro

Description

@alessandrotesoro

Hi there,

First of all, thank you for creating such a great plugin. I've been using it for some time now and it's been great.

I've primarily been a Claude user, but lately I've been testing Codex and I've noticed a big difference in how the /ce-code-review skill behaves between the two.

When used against the same task/commit diff, Codex tends to return single-digit findings, while Claude usually produces double-digit findings. The difference between the two is pretty significant.

When I asked Codex why there was such a difference between the results, it said:

I did not spawn the full applicable reviewer set for the final diff.
I did not create/read a fresh run artifact and then synthesize from every reviewer output.
I over-applied the “validate and drop weak findings” mindset before showing you the raw-to-final mapping.
Claude Code has a more direct implementation path for this CE skill; in Codex I had to manually emulate it with spawn_agent, and I under-emulated it.

This isn't a one-off, it consistently happens every time I run the skill. Is this a bug or a limitation of Codex? I'm also not sure if it's related to this #799 ?


Edit: here's some more comments from Codex:

Why results diverged:

Claude artifacts include raw reviewer output, not just final merged findings. Many items are testing_gaps or residual_risks, not primary findings. My final report only surfaced the main actionable findings and summarized coverage too tightly.

A lot of Claude’s raw findings violate the skill’s strict merge schema: severity values like high/medium/low, confidence values like 90/95/80. Under the skill’s Stage 5 validation, those are supposed to be dropped or normalized before final synthesis. I dropped some instead of salvaging/normalizing useful content.

I demoted or omitted several “coverage-only” items that Claude preserved, ....

Reviewer nondeterminism is real here. Different harness, model tiering, and agent implementations produced different issue discovery. But that does not fully excuse it: several Claude findings are valid and I should have caught or at least surfaced them as testing gaps/residual risks.

I also compressed the final report too much after applying the one safe fix. The skill asks for richer Coverage, Deployment Notes, and residual/actionable tables. My final answer was materially thinner than the Claude artifact set.

So the honest answer: I did spawn the required agents, but my synthesis pass was stricter and less complete than it should have been. The biggest actual miss is not agent coverage; it is that I failed to reconcile valid raw signals into the final report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions