Skip to content

Add AI-powered build failure analysis agentic workflow#54401

Open
YuliiaKovalova wants to merge 5 commits into
dotnet:mainfrom
YuliiaKovalova:feat/onboard-build-failure-analysis
Open

Add AI-powered build failure analysis agentic workflow#54401
YuliiaKovalova wants to merge 5 commits into
dotnet:mainfrom
YuliiaKovalova:feat/onboard-build-failure-analysis

Conversation

@YuliiaKovalova
Copy link
Copy Markdown
Member

@YuliiaKovalova YuliiaKovalova commented May 21, 2026

Summary

Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR: when ./build.sh --binaryLog fails, an AI agent reads the binlog and inspects NuGet package metadata through two MCP servers --

  • binlog-mcp -- structured access to the MSBuild binary log (errors, warnings, target timeline, project graph).
  • NuGet.Mcp.Server -- package version / dependency / vulnerability lookup against nuget.org, used to disambiguate package-related failures (NU1605 downgrades, MSB3277 conflicts, runtime/aspire/templating pin drift across the SDK).

-- groups errors by root cause, and posts:

  1. A single PR-level summary comment describing each failure cluster, the likely root cause, and a recommended fix.
  2. Inline suggestion blocks tied to the diff for changes the reviewer can accept with one click.

The workflow is advisory only -- it never gates the merge, and is fork-PR-safe (forks: [] skips outside-contributor PRs).

What's in the PR

  • .github/agents/build-failure-analyst.agent.md -- repo-tailored agent prompt covering SDK-specific concerns: CLI host + sub-commands under src/Cli, generated manpages (documentation/manpages/sdk), the boundary with dotnet/templating for template-engine code, the local toolchain at ./.dotnet/, and the rule that .xlf files are auto-generated from .resx and must not be hand-edited.
  • .github/workflows/build-failure-analysis.md (+ generated .lock.yml) -- pull-request trigger on [main, release/**]; runs the build with continue-on-error, installs the two MCP servers (Microsoft.AITools.BinlogMcp + NuGet.Mcp.Server) as dotnet tools, dumps the binlog to JSON via the DumpBinlog helper, and delegates to the analyst agent. NuGet.Mcp.Server is registered with gh-aw as a long-running MCP service the agent can call during analysis. Timeout 60 min (SDK builds are heavier than typical).
  • .github/workflows/build-failure-analysis-command.md (+ .lock.yml) -- /analyze-build-failure slash command for re-running the analysis after force-pushes. Restricted to admin/maintainer/write roles.
  • .github/workflows/shared/build-failure-analysis-shared.md -- shared delegation body imported by both workflows.
  • .github/workflows/scripts/DumpBinlog/ -- standalone C# console app (net9.0, ModelContextProtocol 1.3.0) that speaks MCP stdio to binlog-mcp and writes overview.json/errors.json/warnings.json. Used as a pre-agent step because the gh-aw MCP gateway does not support non-containerized stdio MCP servers.
  • .github/aw/actions-lock.json -- minor bump of gh-aw-actions/setup (v0.71.5 -> v0.74.8), side effect of the newer gh aw CLI used to compile the new workflows.

How it runs

  1. PR opened / updated -> workflow runs ./build.sh --binaryLog -c Release.
  2. If a .binlog is produced (Arcade default location artifacts/log/<config>/Build.binlog), DumpBinlog extracts errors/warnings/overview to /tmp/binlog-data/*.json.
  3. The agent receives the JSON + diff, and -- if the failure looks package-related -- queries NuGet.Mcp.Server for the actual published versions, dependency graphs, and known advisories.
  4. The agent produces a fix plan grounded in both the binlog and live NuGet metadata.
  5. A single summary comment is posted (older runs are auto-collapsed via hide-older-comments); up to 10 inline suggestion comments are added.

Safety properties

  • forks: [] -> agent never runs against an outside-contributor PR (no token leakage).
  • timeout_minutes: 60 -> hard cap.
  • permissions: contents: read, pull-requests: write -> minimum needed.
  • Network restricted to [defaults, dotnet].
  • Slash command is gated to [admin, maintainer, write].

Why this is useful for the SDK

The repo's most common build failures cluster into a few categories where an LLM with the binlog can do meaningful triage:

  • CLI command surface drift between src/Cli and tests.
  • Source generator output not flowing into CoreCompile (generated manpages, generated resource accessors).
  • MSB4019/NU1605 from package version drift across the runtime/aspire/templating pin points.
  • .xlf mismatches when a .resx was edited but localization wasn't synced.
  • TFM-conditioned failures (net472 vs net10.0) in shared helper projects.

Grouping these by root cause and posting an actionable suggestion saves a round trip for the contributor.

Coexistence with existing workflows

This sits alongside the existing agentic workflows in .github/workflows/ (add-tactics-template-on-comment, fix-completions-on-comment, detect-netsdk-diagnostics, etc.). It uses the same .github/aw/actions-lock.json registry and the same network allowlist conventions.


Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR:
when ./build.sh --binaryLog fails, an AI agent reads the binlog (via the
binlog-mcp dotnet global tool), groups errors by root cause, and posts a
single PR comment plus inline ```suggestion blocks tied to the diff.

New files:

* .github/agents/build-failure-analyst.agent.md
    Repo-tailored agent prompt covering SDK-specific concerns (CLI host
    + sub-commands under src/Cli, generated manpages, template-engine
    boundary in dotnet/templating, local toolchain at ./.dotnet/, no
    hand-edits to .xlf files).

* .github/workflows/build-failure-analysis.md (+ .lock.yml)
    Pull-request trigger; runs ./build.sh --binaryLog with continue-on-
    error, installs AITools.BinlogMcp + NuGet.Mcp.Server, dumps the
    binlog to /tmp/binlog-data/*.json via the DumpBinlog helper, and
    delegates to the analyst agent. Advisory only -- does not gate.
    Timeout bumped to 60 min because SDK builds are heavier.

* .github/workflows/build-failure-analysis-command.md (+ .lock.yml)
    /analyze-build-failure slash command for re-running the analysis
    after force-pushes or comment dismissals.

* .github/workflows/shared/build-failure-analysis-shared.md
    Shared delegation body imported by both workflows; launches the
    agent as a background task and noops immediately.

* .github/workflows/scripts/DumpBinlog/
    Standalone C# console app (net9.0, ModelContextProtocol 1.3.0) that
    speaks MCP stdio to binlog-mcp and writes overview/errors/warnings
    JSON. Used as a pre-agent step because the gh-aw MCP gateway does
    not support non-containerized stdio MCP servers.

Modified:

* .github/aw/actions-lock.json
    Minor bump of gh-aw-actions/setup (v0.71.5 -> v0.74.8) -- side
    effect of the newer gh-aw CLI compiling the new workflows.

The workflow is fork-PR-safe (forks: [] skips them) and restricted to
admin/maintainer/write roles for manual reruns. Inline suggestion comments
are capped at 10 per run; the summary comment uses hide-older-comments to
collapse prior runs on update.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 16:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a GitHub Agentic Workflows (gh-aw) automation that reruns ./build.sh --binaryLog on PR updates (and via a maintainer slash-command), extracts structured error data from the produced .binlog, and delegates to a repo-tailored analyst agent to post a single summary comment plus up to 10 inline suggestion comments.

Changes:

  • Introduces PR-triggered and slash-command-triggered gh-aw workflows that run the SDK build, dump binlog data to JSON, and invoke an analysis agent.
  • Adds a standalone DumpBinlog C# console app that speaks MCP over stdio to binlog-mcp and writes overview/errors/warnings JSON files.
  • Adds a build-failure-analyst agent prompt customized for dotnet/sdk conventions and constraints.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
.github/workflows/shared/build-failure-analysis-shared.md Shared gh-aw prompt body that delegates to the build failure analyst agent.
.github/workflows/scripts/DumpBinlog/Program.cs MCP stdio client that calls binlog_* tools and writes JSON outputs for the agent to consume.
.github/workflows/scripts/DumpBinlog/DumpBinlog.csproj Project file for the standalone DumpBinlog console tool (package + TFM).
.github/workflows/build-failure-analysis.md PR-triggered workflow: runs build, installs tools, dumps binlog JSON, exports agent context, and delegates.
.github/workflows/build-failure-analysis.lock.yml Generated compiled workflow lockfile for the PR-triggered workflow.
.github/workflows/build-failure-analysis-command.md Slash-command workflow: reruns build + binlog dump and delegates analysis.
.github/workflows/build-failure-analysis-command.lock.yml Generated compiled workflow lockfile for the slash-command workflow.
.github/aw/actions-lock.json Updates gh-aw setup action pin (v0.71.5 → v0.74.8).
.github/agents/build-failure-analyst.agent.md Repo-specific agent instructions for clustering failures and posting summary + inline suggestions.
Comments suppressed due to low confidence (1)

.github/agents/build-failure-analyst.agent.md:183

  • Same issue here: ${{ github.server_url }}/${{ github.repository }}/${{ github.run_id }} is Actions expression syntax and won’t resolve when the agent renders the summary comment. Recommend switching these templates to use environment variables (or have the calling workflow pass the fully-formed run URL/repo URL into GH_AW_* variables).
<sub>🤖 Generated by the [Build Failure Analysis workflow](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}) using <a href="https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet-tools/NuGet/AITools.BinlogMcp">binlog-mcp</a> · commit ${GH_AW_PR_HEAD_SHA}</sub>

Build links to source using ${{ github.server_url }}/${{ github.repository }}/blob/${GH_AW_PR_HEAD_SHA}/<relative-path>#L<line>.

</details>

Comment thread .github/workflows/build-failure-analysis-command.md
Comment thread .github/agents/build-failure-analyst.agent.md Outdated
Comment thread .github/agents/build-failure-analyst.agent.md Outdated
Comment thread .github/workflows/build-failure-analysis.md
Comment thread .github/workflows/build-failure-analysis-command.md
Comment thread .github/workflows/scripts/DumpBinlog/Program.cs
YuliiaKovalova and others added 2 commits May 21, 2026 19:07
* Agent prompt: fix unclosed triple-backtick fence in frontmatter description
  -- use inline `suggestion` blocks instead, so the prompt doesn't start a
  stray markdown code fence.
* Agent prompt: replace GitHub Actions expression syntax (github.* templates)
  with runtime env vars (GITHUB_SERVER_URL, GITHUB_REPOSITORY, GITHUB_RUN_ID)
  -- Actions expressions are not interpolated inside an agent prompt and
  would have been posted literally in the summary comment.
* Agent prompt: fix tool-name mismatch -- the example said
  nuget_fix_vulnerable_packages but the documented tool is
  fix_vulnerable_packages. Aligned the example with the actual tool name.
* PR-trigger workflow + shared body: same fence-closing fix in the
  description / shared prompt body so importers don't start a stray
  markdown code fence.
* Command (slash-command) workflow: add NUGET_MCP_VERSION env, an
  "Install NuGet MCP Server" step, and NuGet.Mcp.Server in the bash tool
  allowlist -- previously the command workflow couldn't run the NuGet
  remediation path the analyst prompt asks for, so /analyze-build-failure
  reruns were silently degraded vs the PR-trigger flow.
* Both workflows: resolve the binlog path to an absolute path in
  find-binlog (via realpath) so GH_AW_BINLOG_PATH is absolute as the
  agent prompt expects. Drop the now-redundant GITHUB_WORKSPACE prefix
  when invoking DumpBinlog.
* DumpBinlog: in the fatal outer catch, write placeholder binlog-*.json
  files with an { "error": ... } payload so downstream `cat` steps and
  the agent always have something structured to read, even when the MCP
  client cannot be constructed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses Dimension 17 (File I/O & Path Handling) NIT from the Expert
MSBuild Review. Previously, an empty `args[0]` would flow through
`Path.GetFullPath("")` (which resolves to the current working directory),
then through `File.Exists(<cwd>)` (false because cwd is a directory),
producing a confusing "Binlog not found: /path/to/cwd" error.

Add an explicit `string.IsNullOrWhiteSpace` check up front so the failure
mode is obvious. Applied to both repos symmetrically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
YuliiaKovalova added a commit to microsoft/testfx that referenced this pull request May 21, 2026
…vers

The previous CI failure at "Start MCP Gateway" (run 26238445223,
job 77219093390) was caused by gh-aw emitting malformed JSON for a
`mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP
server) declaration: the generated config contained `"nuget": {` with
no body and no closing brace, immediately followed by `"safeoutputs":`,
which crashed the gateway with "Expected ',' or '}' after property
value in JSON at position 1001 (line 42 column 1)".

This change rewires NuGet MCP Server the same way `dotnet/sdk` and
`dotnet/msbuild` use it:

  * `NuGet.Mcp.Server` is installed as a dotnet global tool in a
    pre-agent step (already present here).
  * It is listed in the workflow's `bash:` allowlist so the agent can
    invoke `NuGet.Mcp.Server` directly via the shell tool.
  * The MCP gateway never sees a `nuget` server entry, so the
    malformed-JSON path that crashes the gateway is bypassed entirely.

Verified: the regenerated `*.lock.yml` MCP config now contains only the
`github` and `safeoutputs` server entries and parses as valid JSON.

Also applies the same Copilot-review fixes that landed in
`dotnet/sdk#54401` and `dotnet/msbuild#13835`:

  * Triple-backtick fence inside YAML `description:` blocks rewritten
    as inline backtick spans so the description is not interpreted as
    an unterminated code fence.
  * `${{ github.* }}` expressions inside the agent prompt body
    rewritten as runtime env vars (`${GITHUB_SERVER_URL}`,
    `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body
    is passed verbatim to the LLM and is not YAML, so MSBuild-style
    expressions stay literal.
  * Agent description corrected to point at the C# `DumpBinlog`
    helper instead of the removed `dump-binlog.js`.
  * `Locate binlog` step now resolves the binlog path to absolute via
    `realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH`
    as absolute, and the `Dump binlog as JSON` step drops the redundant
    `$GITHUB_WORKSPACE/` prefix.
  * Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s
    to match the sdk/msbuild workflows (large binlogs in those repos
    needed the extra headroom).
  * `DumpBinlog/Program.cs` rejects an empty-string binlog argument
    before calling `Path.GetFullPath`, and if `McpClient.CreateAsync`
    throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }`
    payload to each of the three output JSON files so the agent's
    `cat` step always has structured input to read.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
YuliiaKovalova added a commit to YuliiaKovalova/dotnet that referenced this pull request May 21, 2026
The same Copilot-review feedback that was addressed on the equivalent
PRs in `dotnet/sdk` (#54401) and `dotnet/msbuild` (#13835) also applies
to the VMR version of this workflow. Bringing those three fixes here:

  * `build-failure-analyst.agent.md` — fix stale reference to the
    removed `dump-binlog.js` helper. The C# `DumpBinlog` program has
    replaced it; update the agent description so the agent looks for
    the right artifact.

  * `build-failure-analyst.agent.md` — replace `${{ github.* }}`
    expressions in the agent prompt body with runtime env vars
    (`${GITHUB_SERVER_URL}`, `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`,
    `${GH_AW_PR_HEAD_SHA}`). The agent prompt body is passed verbatim
    to the LLM and is *not* YAML, so `${{ ... }}` expressions stay
    literal in the rendered comment instead of being substituted at
    runtime.

  * `DumpBinlog/Program.cs` — reject an empty-string binlog argument
    explicitly before calling `Path.GetFullPath` (a NIT from the
    msbuild Expert MSBuild Review), and, if `McpClient.CreateAsync`
    throws (e.g., `binlog-mcp` is not on PATH), write a stub
    `{ "error": "DumpBinlog fatal: …" }` payload to each of the three
    expected output JSON files so the agent's `cat` step always has
    structured input to read. Without this, a fatal MCP-client failure
    silently produced no files at all and the agent could not even
    report the failure.

The lock file (`build-failure-analysis.lock.yml`) is regenerated by
`gh aw compile --strict`.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
YuliiaKovalova added a commit to microsoft/testfx that referenced this pull request May 21, 2026
…vers

The previous CI failure at "Start MCP Gateway" (run 26238445223,
job 77219093390) was caused by gh-aw emitting malformed JSON for a
`mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP
server) declaration: the generated config contained `"nuget": {` with
no body and no closing brace, immediately followed by `"safeoutputs":`,
which crashed the gateway with "Expected ',' or '}' after property
value in JSON at position 1001 (line 42 column 1)".

This change rewires NuGet MCP Server the same way `dotnet/sdk` and
`dotnet/msbuild` use it:

  * `NuGet.Mcp.Server` is installed as a dotnet global tool in a
    pre-agent step (already present here).
  * It is listed in the workflow's `bash:` allowlist so the agent can
    invoke `NuGet.Mcp.Server` directly via the shell tool.
  * The MCP gateway never sees a `nuget` server entry, so the
    malformed-JSON path that crashes the gateway is bypassed entirely.

Verified: the regenerated `*.lock.yml` MCP config now contains only the
`github` and `safeoutputs` server entries and parses as valid JSON.

Also applies the same Copilot-review fixes that landed in
`dotnet/sdk#54401` and `dotnet/msbuild#13835`:

  * Triple-backtick fence inside YAML `description:` blocks rewritten
    as inline backtick spans so the description is not interpreted as
    an unterminated code fence.
  * `${{ github.* }}` expressions inside the agent prompt body
    rewritten as runtime env vars (`${GITHUB_SERVER_URL}`,
    `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body
    is passed verbatim to the LLM and is not YAML, so MSBuild-style
    expressions stay literal.
  * Agent description corrected to point at the C# `DumpBinlog`
    helper instead of the removed `dump-binlog.js`.
  * `Locate binlog` step now resolves the binlog path to absolute via
    `realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH`
    as absolute, and the `Dump binlog as JSON` step drops the redundant
    `$GITHUB_WORKSPACE/` prefix.
  * Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s
    to match the sdk/msbuild workflows (large binlogs in those repos
    needed the extra headroom).
  * `DumpBinlog/Program.cs` rejects an empty-string binlog argument
    before calling `Path.GetFullPath`, and if `McpClient.CreateAsync`
    throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }`
    payload to each of the three output JSON files so the agent's
    `cat` step always has structured input to read.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…otnet

- Remove NuGet.Mcp.Server from bash tools list (causes empty
  'nuget': {} block in MCP Gateway JSON, crashing the gateway)
- Add 'dotnet' to bash allowlist so agent can invoke global tools
- Update agent to use 'dotnet NuGet.Mcp.Server' instead of stdin JSON
- Recompile both lock files with gh aw compile --strict

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@YuliiaKovalova
Copy link
Copy Markdown
Member Author

the output example: YuliiaKovalova/dotnet#3 (comment)

@YuliiaKovalova YuliiaKovalova requested a review from baronfel May 21, 2026 18:35
…ogMcp

The dotnet-tools feed package that backs the `binlog-mcp` CLI has been
renamed from `AITools.BinlogMcp` to `Microsoft.AITools.BinlogMcp` (now
under the canonical `Microsoft.*` namespace). The CLI command exposed
by the tool (`binlog-mcp`) is unchanged, so only the
`dotnet tool install --global …` line and the package version need to
move.

  * Bump `BINLOG_MCP_VERSION` from `1.0.0-preview.26268.3` to
    `1.0.0-preview.26272.1` (the first version published under the new
    package id).
  * Update `dotnet tool install --global` invocations in both
    `build-failure-analysis.md` and `build-failure-analysis-command.md`.
  * Update the AzDO feed permalink in the agent's comment-footer
    template to point at the new package.
  * Regenerate `.lock.yml` files via `gh aw compile --strict`.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants