Skip to content

Commit ce383da

Browse files
sjarmakclaude
andcommitted
feat: scaffold 3 tasks to close ≥1GB codebase coverage gaps
Add benchmark tasks for the 3 suites that lacked tasks on ≥1GB repos: - k8s-rbac-auth-audit-001 (csb_sdlc_secure): Kubernetes RBAC authorization flow security audit using sg-evals/kubernetes--v1.32.0 (1.4GB) - grafana-platform-orient-001 (csb_sdlc_understand): Grafana codebase orientation using sg-evals/grafana--v11.4.0 (1.4GB) - CCX-crossorg-280 (csb_org_crossorg): Prometheus metrics exposition pattern across Kubernetes + Grafana (combined 2.9GB) All 20 suites now have at least one task on a ≥1GB codebase. Total benchmark: 373 tasks (151 SDLC + 221 Org + 1 new crossorg). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent db3f1ed commit ce383da

File tree

26 files changed

+3425
-0
lines changed

26 files changed

+3425
-0
lines changed
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
FROM ubuntu:22.04
2+
3+
ENV DEBIAN_FRONTEND=noninteractive
4+
5+
# Base tools
6+
RUN apt-get update && apt-get install -y --no-install-recommends \
7+
git \
8+
ca-certificates \
9+
curl \
10+
python3 \
11+
&& rm -rf /var/lib/apt/lists/*
12+
13+
WORKDIR /workspace
14+
15+
# Clone local checkout repos (baseline config: agent has local access to these)
16+
RUN git clone --depth 1 https://github.com/sg-evals/kubernetes--v1.32.0 /workspace/kubernetes--v1.32.0
17+
RUN git clone --depth 1 https://github.com/sg-evals/grafana--v11.4.0 /workspace/grafana--v11.4.0
18+
19+
# Initialize git identity for agent commits
20+
RUN git config --global user.email "agent@example.com" && \
21+
git config --global user.name "Agent" && \
22+
git config --global safe.directory '*'
23+
24+
# Create log directories
25+
RUN mkdir -p /logs/agent /logs/verifier
26+
27+
# Pre-create claude user and set ownership at build time so Harbor's
28+
# runtime chown is a no-op (avoids 15-30 min delay on large repos).
29+
RUN (adduser --disabled-password --gecos '' claude 2>/dev/null || true) && \
30+
for d in /workspace /app /testbed /logs; do [ -d "$d" ] && chown -R claude:claude "$d"; done || true
31+
32+
ENTRYPOINT []
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Prometheus Metrics Exposition Pattern Across Kubernetes and Grafana
2+
3+
## Your Task
4+
5+
Find the Go source files in both kubernetes/kubernetes and grafana/grafana that implement the Prometheus metrics exposition pattern: where metrics are registered using the Prometheus client_golang library, how metric collectors are organized, and where the `/metrics` HTTP endpoint is configured to expose them. Compare how both projects structure their metrics infrastructure.
6+
7+
## Context
8+
9+
You are working on a codebase task involving repos from the crossorg domain. Both Kubernetes and Grafana use the Prometheus client_golang library to expose internal metrics, but they organize their metrics registration and exposition differently. Your goal is to map these patterns across both codebases.
10+
11+
## Available Resources
12+
13+
The local `/workspace/` directory contains: sg-evals/kubernetes--v1.32.0, sg-evals/grafana--v11.4.0.
14+
15+
**Note:** Additional repositories are accessible via Sourcegraph MCP tools:
16+
- `sg-evals/kubernetes--v1.32.0` (kubernetes/kubernetes)
17+
- `sg-evals/grafana--v11.4.0` (grafana/grafana)
18+
19+
## Output Format
20+
21+
Create a file at `/workspace/answer.json` with your findings in the following structure:
22+
23+
```json
24+
{
25+
"files": [
26+
{"repo": "org/repo-name", "path": "relative/path/to/file.go"}
27+
],
28+
"symbols": [
29+
{"repo": "org/repo-name", "path": "relative/path/to/file.go", "symbol": "SymbolName"}
30+
],
31+
"chain": [],
32+
"text": "Narrative explanation of your findings, citing repos and file paths."
33+
}
34+
```
35+
36+
Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
37+
38+
## Evaluation
39+
40+
Your answer will be scored on:
41+
- **File recall and precision**: Did you find all relevant files?
42+
- **Symbol identification**: Correct symbol names and locations?
43+
- **Keyword presence**: Did you mention key concepts (metrics registry, collectors, exposition)?
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
version = "1.0"
2+
3+
[metadata]
4+
name = "CCX-crossorg-280"
5+
description = "Prometheus Metrics Exposition Pattern Across Kubernetes and Grafana"
6+
license = "Apache-2.0"
7+
8+
[task]
9+
id = "CCX-crossorg-280"
10+
repo = "kubernetes/kubernetes"
11+
category = "cross-org-discovery"
12+
language = "go"
13+
difficulty = "hard"
14+
time_limit_sec = 900
15+
mcp_suite = "csb_org_crossorg"
16+
use_case_id = 280
17+
repo_set_id = "kubernetes-grafana-observability"
18+
mcp_unique = true
19+
verification_modes = ["artifact"]
20+
21+
[verification]
22+
type = "test"
23+
command = "bash /tests/test.sh"
24+
25+
reward_type = "score"
26+
description = "Prometheus Metrics Exposition Pattern Across Kubernetes and Grafana"
27+
28+
[environment]
29+
build_timeout_sec = 600.0
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
#!/bin/bash
2+
# eval.sh — MCP-unique benchmark evaluator for CCX-crossorg-280
3+
# Exit-code-first (SWE-Factory pattern):
4+
# exit 0 — agent produced useful output (composite score > 0)
5+
# exit 1 — total failure (composite score == 0 or missing answer)
6+
#
7+
# Writes /logs/verifier/reward.txt with the composite score [0.0, 1.0]
8+
9+
set -euo pipefail
10+
11+
TASK_ID="CCX-crossorg-280"
12+
ANSWER_PATH="/workspace/answer.json"
13+
TASK_SPEC_PATH="/tests/task_spec.json"
14+
ORACLE_CHECKS="/tests/oracle_checks.py"
15+
REWARD_PATH="/logs/verifier/reward.txt"
16+
17+
mkdir -p /logs/verifier
18+
19+
echo "=== CCX-crossorg-280 evaluator ==="
20+
echo "Task spec: $TASK_SPEC_PATH"
21+
echo "Answer: $ANSWER_PATH"
22+
echo ""
23+
24+
# sg_only mode guard: restore full repo if verifier wrapper exists
25+
if [ -f /tmp/.sg_only_mode ] && [ -f /tests/sgonly_verifier_wrapper.sh ]; then
26+
echo "sg_only mode: sourcing verifier wrapper..."
27+
source /tests/sgonly_verifier_wrapper.sh
28+
fi
29+
30+
# Verify answer file exists
31+
if [ ! -f "$ANSWER_PATH" ]; then
32+
echo "ERROR: answer.json not found at $ANSWER_PATH"
33+
echo "0.0" > "$REWARD_PATH"
34+
exit 1
35+
fi
36+
37+
# Validate answer is valid JSON
38+
if ! python3 -c "import json; json.load(open('$ANSWER_PATH'))" 2>/dev/null; then
39+
echo "ERROR: answer.json is not valid JSON"
40+
echo "0.0" > "$REWARD_PATH"
41+
exit 1
42+
fi
43+
44+
echo "answer.json found and valid JSON"
45+
46+
# Run oracle checks
47+
if [ ! -f "$ORACLE_CHECKS" ]; then
48+
echo "ERROR: oracle_checks.py not found at $ORACLE_CHECKS"
49+
echo "0.0" > "$REWARD_PATH"
50+
exit 1
51+
fi
52+
53+
echo "Running oracle checks..."
54+
SCORE=$(python3 "$ORACLE_CHECKS" --answer "$ANSWER_PATH" --spec "$TASK_SPEC_PATH" --verbose 2>&1 | tee /dev/stderr | tail -1) || true
55+
56+
# Validate score is a number
57+
if ! echo "$SCORE" | python3 -c "import sys; float(sys.stdin.read().strip())" 2>/dev/null; then
58+
echo "ERROR: oracle_checks.py did not return a valid score: $SCORE"
59+
echo "0.0" > "$REWARD_PATH"
60+
exit 1
61+
fi
62+
63+
echo ""
64+
echo "Composite score: $SCORE"
65+
echo "$SCORE" > "$REWARD_PATH"
66+
67+
# Exit based on score (SWE-Factory exit-code-first pattern)
68+
python3 -c "import sys; sys.exit(0 if float('$SCORE') > 0 else 1)"
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"files": [
3+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/registry.go"},
4+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/counter.go"},
5+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/gauge.go"},
6+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/histogram.go"},
7+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/opts.go"},
8+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go"},
9+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "pkg/kubelet/metrics/metrics.go"},
10+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/legacyregistry/registry.go"},
11+
{"repo": "sg-evals/grafana--v11.4.0", "path": "pkg/infra/metrics/metrics.go"},
12+
{"repo": "sg-evals/grafana--v11.4.0", "path": "pkg/api/metrics.go"},
13+
{"repo": "sg-evals/grafana--v11.4.0", "path": "pkg/infra/metrics/service.go"},
14+
{"repo": "sg-evals/grafana--v11.4.0", "path": "pkg/services/ngalert/metrics/metrics.go"}
15+
],
16+
"symbols": [
17+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/registry.go", "symbol": "KubeRegistry"},
18+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/counter.go", "symbol": "Counter"},
19+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go", "symbol": "requestCounter"},
20+
{"repo": "sg-evals/kubernetes--v1.32.0", "path": "staging/src/k8s.io/component-base/metrics/legacyregistry/registry.go", "symbol": "Register"},
21+
{"repo": "sg-evals/grafana--v11.4.0", "path": "pkg/infra/metrics/metrics.go", "symbol": "MStatTotalDashboards"},
22+
{"repo": "sg-evals/grafana--v11.4.0", "path": "pkg/api/metrics.go", "symbol": "MApiStatus"},
23+
{"repo": "sg-evals/grafana--v11.4.0", "path": "pkg/infra/metrics/service.go", "symbol": "InternalMetricsService"}
24+
],
25+
"text": "Both Kubernetes and Grafana use the Prometheus client_golang library for metrics exposition but with distinct architectural patterns. Kubernetes wraps the Prometheus client in staging/src/k8s.io/component-base/metrics/ with a custom KubeRegistry that adds stability-level annotations and deprecation tracking. Individual subsystems (apiserver, kubelet, scheduler) register their own metrics using this wrapper. Grafana uses a more centralized approach with pkg/infra/metrics/ containing global metric definitions registered at startup, plus per-service metrics (ngalert, api). Both expose metrics via /metrics HTTP endpoints using promhttp handlers.",
26+
"_metadata": {
27+
"oracle_type": "file_set_match",
28+
"discovery_method": "manual_code_review",
29+
"verified_at": "2026-03-03"
30+
}
31+
}

0 commit comments

Comments
 (0)