Add AI-powered build failure analysis agentic workflow by YuliiaKovalova · Pull Request #54401 · dotnet/sdk

YuliiaKovalova · 2026-05-21T16:53:18Z

Summary

Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR: when ./build.sh --binaryLog fails, an AI agent reads the binlog and inspects NuGet package metadata through two MCP servers --

binlog-mcp -- structured access to the MSBuild binary log (errors, warnings, target timeline, project graph).
NuGet.Mcp.Server -- package version / dependency / vulnerability lookup against nuget.org, used to disambiguate package-related failures (NU1605 downgrades, MSB3277 conflicts, runtime/aspire/templating pin drift across the SDK).

-- groups errors by root cause, and posts:

A single PR-level summary comment describing each failure cluster, the likely root cause, and a recommended fix.
Inline suggestion blocks tied to the diff for changes the reviewer can accept with one click.

The workflow is advisory only -- it never gates the merge, and is fork-PR-safe (forks: [] skips outside-contributor PRs).

What's in the PR

.github/agents/build-failure-analyst.agent.md -- repo-tailored agent prompt covering SDK-specific concerns: CLI host + sub-commands under src/Cli, generated manpages (documentation/manpages/sdk), the boundary with dotnet/templating for template-engine code, the local toolchain at ./.dotnet/, and the rule that .xlf files are auto-generated from .resx and must not be hand-edited.
.github/workflows/build-failure-analysis.md (+ generated .lock.yml) -- pull-request trigger on [main, release/**]; runs the build with continue-on-error, installs the two MCP servers (Microsoft.AITools.BinlogMcp + NuGet.Mcp.Server) as dotnet tools, dumps the binlog to JSON via the DumpBinlog helper, and delegates to the analyst agent. NuGet.Mcp.Server is registered with gh-aw as a long-running MCP service the agent can call during analysis. Timeout 60 min (SDK builds are heavier than typical).
.github/workflows/build-failure-analysis-command.md (+ .lock.yml) -- /analyze-build-failure slash command for re-running the analysis after force-pushes. Restricted to admin/maintainer/write roles.
.github/workflows/shared/build-failure-analysis-shared.md -- shared delegation body imported by both workflows.
.github/workflows/scripts/DumpBinlog/ -- standalone C# console app (net9.0, ModelContextProtocol 1.3.0) that speaks MCP stdio to binlog-mcp and writes overview.json/errors.json/warnings.json. Used as a pre-agent step because the gh-aw MCP gateway does not support non-containerized stdio MCP servers.
.github/aw/actions-lock.json -- minor bump of gh-aw-actions/setup (v0.71.5 -> v0.74.8), side effect of the newer gh aw CLI used to compile the new workflows.

How it runs

PR opened / updated -> workflow runs ./build.sh --binaryLog -c Release.
If a .binlog is produced (Arcade default location artifacts/log/<config>/Build.binlog), DumpBinlog extracts errors/warnings/overview to /tmp/binlog-data/*.json.
The agent receives the JSON + diff, and -- if the failure looks package-related -- queries NuGet.Mcp.Server for the actual published versions, dependency graphs, and known advisories.
The agent produces a fix plan grounded in both the binlog and live NuGet metadata.
A single summary comment is posted (older runs are auto-collapsed via hide-older-comments); up to 10 inline suggestion comments are added.

Safety properties

forks: [] -> agent never runs against an outside-contributor PR (no token leakage).
timeout_minutes: 60 -> hard cap.
permissions: contents: read, pull-requests: write -> minimum needed.
Network restricted to [defaults, dotnet].
Slash command is gated to [admin, maintainer, write].

Why this is useful for the SDK

The repo's most common build failures cluster into a few categories where an LLM with the binlog can do meaningful triage:

CLI command surface drift between src/Cli and tests.
Source generator output not flowing into CoreCompile (generated manpages, generated resource accessors).
MSB4019/NU1605 from package version drift across the runtime/aspire/templating pin points.
.xlf mismatches when a .resx was edited but localization wasn't synced.
TFM-conditioned failures (net472 vs net10.0) in shared helper projects.

Grouping these by root cause and posting an actionable suggestion saves a round trip for the contributor.

Coexistence with existing workflows

This sits alongside the existing agentic workflows in .github/workflows/ (add-tactics-template-on-comment, fix-completions-on-comment, detect-netsdk-diagnostics, etc.). It uses the same .github/aw/actions-lock.json registry and the same network allowlist conventions.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR: when ./build.sh --binaryLog fails, an AI agent reads the binlog (via the binlog-mcp dotnet global tool), groups errors by root cause, and posts a single PR comment plus inline ```suggestion blocks tied to the diff. New files: * .github/agents/build-failure-analyst.agent.md Repo-tailored agent prompt covering SDK-specific concerns (CLI host + sub-commands under src/Cli, generated manpages, template-engine boundary in dotnet/templating, local toolchain at ./.dotnet/, no hand-edits to .xlf files). * .github/workflows/build-failure-analysis.md (+ .lock.yml) Pull-request trigger; runs ./build.sh --binaryLog with continue-on- error, installs AITools.BinlogMcp + NuGet.Mcp.Server, dumps the binlog to /tmp/binlog-data/*.json via the DumpBinlog helper, and delegates to the analyst agent. Advisory only -- does not gate. Timeout bumped to 60 min because SDK builds are heavier. * .github/workflows/build-failure-analysis-command.md (+ .lock.yml) /analyze-build-failure slash command for re-running the analysis after force-pushes or comment dismissals. * .github/workflows/shared/build-failure-analysis-shared.md Shared delegation body imported by both workflows; launches the agent as a background task and noops immediately. * .github/workflows/scripts/DumpBinlog/ Standalone C# console app (net9.0, ModelContextProtocol 1.3.0) that speaks MCP stdio to binlog-mcp and writes overview/errors/warnings JSON. Used as a pre-agent step because the gh-aw MCP gateway does not support non-containerized stdio MCP servers. Modified: * .github/aw/actions-lock.json Minor bump of gh-aw-actions/setup (v0.71.5 -> v0.74.8) -- side effect of the newer gh-aw CLI compiling the new workflows. The workflow is fork-PR-safe (forks: [] skips them) and restricted to admin/maintainer/write roles for manual reruns. Inline suggestion comments are capped at 10 per run; the summary comment uses hide-older-comments to collapse prior runs on update. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a GitHub Agentic Workflows (gh-aw) automation that reruns ./build.sh --binaryLog on PR updates (and via a maintainer slash-command), extracts structured error data from the produced .binlog, and delegates to a repo-tailored analyst agent to post a single summary comment plus up to 10 inline suggestion comments.

Changes:

Introduces PR-triggered and slash-command-triggered gh-aw workflows that run the SDK build, dump binlog data to JSON, and invoke an analysis agent.
Adds a standalone DumpBinlog C# console app that speaks MCP over stdio to binlog-mcp and writes overview/errors/warnings JSON files.
Adds a build-failure-analyst agent prompt customized for dotnet/sdk conventions and constraints.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
.github/workflows/shared/build-failure-analysis-shared.md	Shared gh-aw prompt body that delegates to the build failure analyst agent.
.github/workflows/scripts/DumpBinlog/Program.cs	MCP stdio client that calls `binlog_*` tools and writes JSON outputs for the agent to consume.
.github/workflows/scripts/DumpBinlog/DumpBinlog.csproj	Project file for the standalone DumpBinlog console tool (package + TFM).
.github/workflows/build-failure-analysis.md	PR-triggered workflow: runs build, installs tools, dumps binlog JSON, exports agent context, and delegates.
.github/workflows/build-failure-analysis.lock.yml	Generated compiled workflow lockfile for the PR-triggered workflow.
.github/workflows/build-failure-analysis-command.md	Slash-command workflow: reruns build + binlog dump and delegates analysis.
.github/workflows/build-failure-analysis-command.lock.yml	Generated compiled workflow lockfile for the slash-command workflow.
.github/aw/actions-lock.json	Updates gh-aw setup action pin (v0.71.5 → v0.74.8).
.github/agents/build-failure-analyst.agent.md	Repo-specific agent instructions for clustering failures and posting summary + inline suggestions.

Comments suppressed due to low confidence (1)

.github/agents/build-failure-analyst.agent.md:183

Same issue here: ${{ github.server_url }}/${{ github.repository }}/${{ github.run_id }} is Actions expression syntax and won’t resolve when the agent renders the summary comment. Recommend switching these templates to use environment variables (or have the calling workflow pass the fully-formed run URL/repo URL into GH_AW_* variables).

<sub>🤖 Generated by the [Build Failure Analysis workflow](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}) using <a href="https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet-tools/NuGet/AITools.BinlogMcp">binlog-mcp</a> · commit ${GH_AW_PR_HEAD_SHA}</sub>

Build links to source using ${{ github.server_url }}/${{ github.repository }}/blob/${GH_AW_PR_HEAD_SHA}/<relative-path>#L<line>.

</details>

* Agent prompt: fix unclosed triple-backtick fence in frontmatter description -- use inline `suggestion` blocks instead, so the prompt doesn't start a stray markdown code fence. * Agent prompt: replace GitHub Actions expression syntax (github.* templates) with runtime env vars (GITHUB_SERVER_URL, GITHUB_REPOSITORY, GITHUB_RUN_ID) -- Actions expressions are not interpolated inside an agent prompt and would have been posted literally in the summary comment. * Agent prompt: fix tool-name mismatch -- the example said nuget_fix_vulnerable_packages but the documented tool is fix_vulnerable_packages. Aligned the example with the actual tool name. * PR-trigger workflow + shared body: same fence-closing fix in the description / shared prompt body so importers don't start a stray markdown code fence. * Command (slash-command) workflow: add NUGET_MCP_VERSION env, an "Install NuGet MCP Server" step, and NuGet.Mcp.Server in the bash tool allowlist -- previously the command workflow couldn't run the NuGet remediation path the analyst prompt asks for, so /analyze-build-failure reruns were silently degraded vs the PR-trigger flow. * Both workflows: resolve the binlog path to an absolute path in find-binlog (via realpath) so GH_AW_BINLOG_PATH is absolute as the agent prompt expects. Drop the now-redundant GITHUB_WORKSPACE prefix when invoking DumpBinlog. * DumpBinlog: in the fatal outer catch, write placeholder binlog-*.json files with an { "error": ... } payload so downstream `cat` steps and the agent always have something structured to read, even when the MCP client cannot be constructed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses Dimension 17 (File I/O & Path Handling) NIT from the Expert MSBuild Review. Previously, an empty `args[0]` would flow through `Path.GetFullPath("")` (which resolves to the current working directory), then through `File.Exists(<cwd>)` (false because cwd is a directory), producing a confusing "Binlog not found: /path/to/cwd" error. Add an explicit `string.IsNullOrWhiteSpace` check up front so the failure mode is obvious. Applied to both repos symmetrically. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…vers The previous CI failure at "Start MCP Gateway" (run 26238445223, job 77219093390) was caused by gh-aw emitting malformed JSON for a `mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP server) declaration: the generated config contained `"nuget": {` with no body and no closing brace, immediately followed by `"safeoutputs":`, which crashed the gateway with "Expected ',' or '}' after property value in JSON at position 1001 (line 42 column 1)". This change rewires NuGet MCP Server the same way `dotnet/sdk` and `dotnet/msbuild` use it: * `NuGet.Mcp.Server` is installed as a dotnet global tool in a pre-agent step (already present here). * It is listed in the workflow's `bash:` allowlist so the agent can invoke `NuGet.Mcp.Server` directly via the shell tool. * The MCP gateway never sees a `nuget` server entry, so the malformed-JSON path that crashes the gateway is bypassed entirely. Verified: the regenerated `*.lock.yml` MCP config now contains only the `github` and `safeoutputs` server entries and parses as valid JSON. Also applies the same Copilot-review fixes that landed in `dotnet/sdk#54401` and `dotnet/msbuild#13835`: * Triple-backtick fence inside YAML `description:` blocks rewritten as inline backtick spans so the description is not interpreted as an unterminated code fence. * `${{ github.* }}` expressions inside the agent prompt body rewritten as runtime env vars (`${GITHUB_SERVER_URL}`, `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body is passed verbatim to the LLM and is not YAML, so MSBuild-style expressions stay literal. * Agent description corrected to point at the C# `DumpBinlog` helper instead of the removed `dump-binlog.js`. * `Locate binlog` step now resolves the binlog path to absolute via `realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH` as absolute, and the `Dump binlog as JSON` step drops the redundant `$GITHUB_WORKSPACE/` prefix. * Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s to match the sdk/msbuild workflows (large binlogs in those repos needed the extra headroom). * `DumpBinlog/Program.cs` rejects an empty-string binlog argument before calling `Path.GetFullPath`, and if `McpClient.CreateAsync` throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }` payload to each of the three output JSON files so the agent's `cat` step always has structured input to read. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The same Copilot-review feedback that was addressed on the equivalent PRs in `dotnet/sdk` (#54401) and `dotnet/msbuild` (#13835) also applies to the VMR version of this workflow. Bringing those three fixes here: * `build-failure-analyst.agent.md` — fix stale reference to the removed `dump-binlog.js` helper. The C# `DumpBinlog` program has replaced it; update the agent description so the agent looks for the right artifact. * `build-failure-analyst.agent.md` — replace `${{ github.* }}` expressions in the agent prompt body with runtime env vars (`${GITHUB_SERVER_URL}`, `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`, `${GH_AW_PR_HEAD_SHA}`). The agent prompt body is passed verbatim to the LLM and is *not* YAML, so `${{ ... }}` expressions stay literal in the rendered comment instead of being substituted at runtime. * `DumpBinlog/Program.cs` — reject an empty-string binlog argument explicitly before calling `Path.GetFullPath` (a NIT from the msbuild Expert MSBuild Review), and, if `McpClient.CreateAsync` throws (e.g., `binlog-mcp` is not on PATH), write a stub `{ "error": "DumpBinlog fatal: …" }` payload to each of the three expected output JSON files so the agent's `cat` step always has structured input to read. Without this, a fatal MCP-client failure silently produced no files at all and the agent could not even report the failure. The lock file (`build-failure-analysis.lock.yml`) is regenerated by `gh aw compile --strict`. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…vers The previous CI failure at "Start MCP Gateway" (run 26238445223, job 77219093390) was caused by gh-aw emitting malformed JSON for a `mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP server) declaration: the generated config contained `"nuget": {` with no body and no closing brace, immediately followed by `"safeoutputs":`, which crashed the gateway with "Expected ',' or '}' after property value in JSON at position 1001 (line 42 column 1)". This change rewires NuGet MCP Server the same way `dotnet/sdk` and `dotnet/msbuild` use it: * `NuGet.Mcp.Server` is installed as a dotnet global tool in a pre-agent step (already present here). * It is listed in the workflow's `bash:` allowlist so the agent can invoke `NuGet.Mcp.Server` directly via the shell tool. * The MCP gateway never sees a `nuget` server entry, so the malformed-JSON path that crashes the gateway is bypassed entirely. Verified: the regenerated `*.lock.yml` MCP config now contains only the `github` and `safeoutputs` server entries and parses as valid JSON. Also applies the same Copilot-review fixes that landed in `dotnet/sdk#54401` and `dotnet/msbuild#13835`: * Triple-backtick fence inside YAML `description:` blocks rewritten as inline backtick spans so the description is not interpreted as an unterminated code fence. * `${{ github.* }}` expressions inside the agent prompt body rewritten as runtime env vars (`${GITHUB_SERVER_URL}`, `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body is passed verbatim to the LLM and is not YAML, so MSBuild-style expressions stay literal. * Agent description corrected to point at the C# `DumpBinlog` helper instead of the removed `dump-binlog.js`. * `Locate binlog` step now resolves the binlog path to absolute via `realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH` as absolute, and the `Dump binlog as JSON` step drops the redundant `$GITHUB_WORKSPACE/` prefix. * Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s to match the sdk/msbuild workflows (large binlogs in those repos needed the extra headroom). * `DumpBinlog/Program.cs` rejects an empty-string binlog argument before calling `Path.GetFullPath`, and if `McpClient.CreateAsync` throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }` payload to each of the three output JSON files so the agent's `cat` step always has structured input to read. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…otnet - Remove NuGet.Mcp.Server from bash tools list (causes empty 'nuget': {} block in MCP Gateway JSON, crashing the gateway) - Add 'dotnet' to bash allowlist so agent can invoke global tools - Update agent to use 'dotnet NuGet.Mcp.Server' instead of stdin JSON - Recompile both lock files with gh aw compile --strict Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

YuliiaKovalova · 2026-05-21T18:35:01Z

the output example: YuliiaKovalova/dotnet#3 (comment)

…ogMcp The dotnet-tools feed package that backs the `binlog-mcp` CLI has been renamed from `AITools.BinlogMcp` to `Microsoft.AITools.BinlogMcp` (now under the canonical `Microsoft.*` namespace). The CLI command exposed by the tool (`binlog-mcp`) is unchanged, so only the `dotnet tool install --global …` line and the package version need to move. * Bump `BINLOG_MCP_VERSION` from `1.0.0-preview.26268.3` to `1.0.0-preview.26272.1` (the first version published under the new package id). * Update `dotnet tool install --global` invocations in both `build-failure-analysis.md` and `build-failure-analysis-command.md`. * Update the AzDO feed permalink in the agent's comment-footer template to point at the new package. * Regenerate `.lock.yml` files via `gh aw compile --strict`. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 21, 2026 16:53

Copilot started reviewing on behalf of YuliiaKovalova May 21, 2026 16:54 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

YuliiaKovalova and others added 2 commits May 21, 2026 19:07

YuliiaKovalova requested a review from baronfel May 21, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AI-powered build failure analysis agentic workflow#54401

Add AI-powered build failure analysis agentic workflow#54401
YuliiaKovalova wants to merge 5 commits into
dotnet:mainfrom
YuliiaKovalova:feat/onboard-build-failure-analysis

YuliiaKovalova commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuliiaKovalova commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YuliiaKovalova commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

How it runs

Safety properties

Why this is useful for the SDK

Coexistence with existing workflows

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YuliiaKovalova commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YuliiaKovalova commented May 21, 2026 •

edited

Loading