Add AI-powered build failure analysis agentic workflow#54401
Open
YuliiaKovalova wants to merge 5 commits into
Open
Add AI-powered build failure analysis agentic workflow#54401YuliiaKovalova wants to merge 5 commits into
YuliiaKovalova wants to merge 5 commits into
Conversation
Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR:
when ./build.sh --binaryLog fails, an AI agent reads the binlog (via the
binlog-mcp dotnet global tool), groups errors by root cause, and posts a
single PR comment plus inline ```suggestion blocks tied to the diff.
New files:
* .github/agents/build-failure-analyst.agent.md
Repo-tailored agent prompt covering SDK-specific concerns (CLI host
+ sub-commands under src/Cli, generated manpages, template-engine
boundary in dotnet/templating, local toolchain at ./.dotnet/, no
hand-edits to .xlf files).
* .github/workflows/build-failure-analysis.md (+ .lock.yml)
Pull-request trigger; runs ./build.sh --binaryLog with continue-on-
error, installs AITools.BinlogMcp + NuGet.Mcp.Server, dumps the
binlog to /tmp/binlog-data/*.json via the DumpBinlog helper, and
delegates to the analyst agent. Advisory only -- does not gate.
Timeout bumped to 60 min because SDK builds are heavier.
* .github/workflows/build-failure-analysis-command.md (+ .lock.yml)
/analyze-build-failure slash command for re-running the analysis
after force-pushes or comment dismissals.
* .github/workflows/shared/build-failure-analysis-shared.md
Shared delegation body imported by both workflows; launches the
agent as a background task and noops immediately.
* .github/workflows/scripts/DumpBinlog/
Standalone C# console app (net9.0, ModelContextProtocol 1.3.0) that
speaks MCP stdio to binlog-mcp and writes overview/errors/warnings
JSON. Used as a pre-agent step because the gh-aw MCP gateway does
not support non-containerized stdio MCP servers.
Modified:
* .github/aw/actions-lock.json
Minor bump of gh-aw-actions/setup (v0.71.5 -> v0.74.8) -- side
effect of the newer gh-aw CLI compiling the new workflows.
The workflow is fork-PR-safe (forks: [] skips them) and restricted to
admin/maintainer/write roles for manual reruns. Inline suggestion comments
are capped at 10 per run; the summary comment uses hide-older-comments to
collapse prior runs on update.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a GitHub Agentic Workflows (gh-aw) automation that reruns ./build.sh --binaryLog on PR updates (and via a maintainer slash-command), extracts structured error data from the produced .binlog, and delegates to a repo-tailored analyst agent to post a single summary comment plus up to 10 inline suggestion comments.
Changes:
- Introduces PR-triggered and slash-command-triggered gh-aw workflows that run the SDK build, dump binlog data to JSON, and invoke an analysis agent.
- Adds a standalone
DumpBinlogC# console app that speaks MCP over stdio tobinlog-mcpand writes overview/errors/warnings JSON files. - Adds a
build-failure-analystagent prompt customized for dotnet/sdk conventions and constraints.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/shared/build-failure-analysis-shared.md | Shared gh-aw prompt body that delegates to the build failure analyst agent. |
| .github/workflows/scripts/DumpBinlog/Program.cs | MCP stdio client that calls binlog_* tools and writes JSON outputs for the agent to consume. |
| .github/workflows/scripts/DumpBinlog/DumpBinlog.csproj | Project file for the standalone DumpBinlog console tool (package + TFM). |
| .github/workflows/build-failure-analysis.md | PR-triggered workflow: runs build, installs tools, dumps binlog JSON, exports agent context, and delegates. |
| .github/workflows/build-failure-analysis.lock.yml | Generated compiled workflow lockfile for the PR-triggered workflow. |
| .github/workflows/build-failure-analysis-command.md | Slash-command workflow: reruns build + binlog dump and delegates analysis. |
| .github/workflows/build-failure-analysis-command.lock.yml | Generated compiled workflow lockfile for the slash-command workflow. |
| .github/aw/actions-lock.json | Updates gh-aw setup action pin (v0.71.5 → v0.74.8). |
| .github/agents/build-failure-analyst.agent.md | Repo-specific agent instructions for clustering failures and posting summary + inline suggestions. |
Comments suppressed due to low confidence (1)
.github/agents/build-failure-analyst.agent.md:183
- Same issue here:
${{ github.server_url }}/${{ github.repository }}/${{ github.run_id }}is Actions expression syntax and won’t resolve when the agent renders the summary comment. Recommend switching these templates to use environment variables (or have the calling workflow pass the fully-formed run URL/repo URL intoGH_AW_*variables).
<sub>🤖 Generated by the [Build Failure Analysis workflow](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}) using <a href="https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet-tools/NuGet/AITools.BinlogMcp">binlog-mcp</a> · commit ${GH_AW_PR_HEAD_SHA}</sub>
Build links to source using ${{ github.server_url }}/${{ github.repository }}/blob/${GH_AW_PR_HEAD_SHA}/<relative-path>#L<line>.
</details>
* Agent prompt: fix unclosed triple-backtick fence in frontmatter description
-- use inline `suggestion` blocks instead, so the prompt doesn't start a
stray markdown code fence.
* Agent prompt: replace GitHub Actions expression syntax (github.* templates)
with runtime env vars (GITHUB_SERVER_URL, GITHUB_REPOSITORY, GITHUB_RUN_ID)
-- Actions expressions are not interpolated inside an agent prompt and
would have been posted literally in the summary comment.
* Agent prompt: fix tool-name mismatch -- the example said
nuget_fix_vulnerable_packages but the documented tool is
fix_vulnerable_packages. Aligned the example with the actual tool name.
* PR-trigger workflow + shared body: same fence-closing fix in the
description / shared prompt body so importers don't start a stray
markdown code fence.
* Command (slash-command) workflow: add NUGET_MCP_VERSION env, an
"Install NuGet MCP Server" step, and NuGet.Mcp.Server in the bash tool
allowlist -- previously the command workflow couldn't run the NuGet
remediation path the analyst prompt asks for, so /analyze-build-failure
reruns were silently degraded vs the PR-trigger flow.
* Both workflows: resolve the binlog path to an absolute path in
find-binlog (via realpath) so GH_AW_BINLOG_PATH is absolute as the
agent prompt expects. Drop the now-redundant GITHUB_WORKSPACE prefix
when invoking DumpBinlog.
* DumpBinlog: in the fatal outer catch, write placeholder binlog-*.json
files with an { "error": ... } payload so downstream `cat` steps and
the agent always have something structured to read, even when the MCP
client cannot be constructed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses Dimension 17 (File I/O & Path Handling) NIT from the Expert
MSBuild Review. Previously, an empty `args[0]` would flow through
`Path.GetFullPath("")` (which resolves to the current working directory),
then through `File.Exists(<cwd>)` (false because cwd is a directory),
producing a confusing "Binlog not found: /path/to/cwd" error.
Add an explicit `string.IsNullOrWhiteSpace` check up front so the failure
mode is obvious. Applied to both repos symmetrically.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
YuliiaKovalova
added a commit
to microsoft/testfx
that referenced
this pull request
May 21, 2026
…vers
The previous CI failure at "Start MCP Gateway" (run 26238445223,
job 77219093390) was caused by gh-aw emitting malformed JSON for a
`mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP
server) declaration: the generated config contained `"nuget": {` with
no body and no closing brace, immediately followed by `"safeoutputs":`,
which crashed the gateway with "Expected ',' or '}' after property
value in JSON at position 1001 (line 42 column 1)".
This change rewires NuGet MCP Server the same way `dotnet/sdk` and
`dotnet/msbuild` use it:
* `NuGet.Mcp.Server` is installed as a dotnet global tool in a
pre-agent step (already present here).
* It is listed in the workflow's `bash:` allowlist so the agent can
invoke `NuGet.Mcp.Server` directly via the shell tool.
* The MCP gateway never sees a `nuget` server entry, so the
malformed-JSON path that crashes the gateway is bypassed entirely.
Verified: the regenerated `*.lock.yml` MCP config now contains only the
`github` and `safeoutputs` server entries and parses as valid JSON.
Also applies the same Copilot-review fixes that landed in
`dotnet/sdk#54401` and `dotnet/msbuild#13835`:
* Triple-backtick fence inside YAML `description:` blocks rewritten
as inline backtick spans so the description is not interpreted as
an unterminated code fence.
* `${{ github.* }}` expressions inside the agent prompt body
rewritten as runtime env vars (`${GITHUB_SERVER_URL}`,
`${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body
is passed verbatim to the LLM and is not YAML, so MSBuild-style
expressions stay literal.
* Agent description corrected to point at the C# `DumpBinlog`
helper instead of the removed `dump-binlog.js`.
* `Locate binlog` step now resolves the binlog path to absolute via
`realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH`
as absolute, and the `Dump binlog as JSON` step drops the redundant
`$GITHUB_WORKSPACE/` prefix.
* Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s
to match the sdk/msbuild workflows (large binlogs in those repos
needed the extra headroom).
* `DumpBinlog/Program.cs` rejects an empty-string binlog argument
before calling `Path.GetFullPath`, and if `McpClient.CreateAsync`
throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }`
payload to each of the three output JSON files so the agent's
`cat` step always has structured input to read.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
YuliiaKovalova
added a commit
to YuliiaKovalova/dotnet
that referenced
this pull request
May 21, 2026
The same Copilot-review feedback that was addressed on the equivalent
PRs in `dotnet/sdk` (#54401) and `dotnet/msbuild` (#13835) also applies
to the VMR version of this workflow. Bringing those three fixes here:
* `build-failure-analyst.agent.md` — fix stale reference to the
removed `dump-binlog.js` helper. The C# `DumpBinlog` program has
replaced it; update the agent description so the agent looks for
the right artifact.
* `build-failure-analyst.agent.md` — replace `${{ github.* }}`
expressions in the agent prompt body with runtime env vars
(`${GITHUB_SERVER_URL}`, `${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`,
`${GH_AW_PR_HEAD_SHA}`). The agent prompt body is passed verbatim
to the LLM and is *not* YAML, so `${{ ... }}` expressions stay
literal in the rendered comment instead of being substituted at
runtime.
* `DumpBinlog/Program.cs` — reject an empty-string binlog argument
explicitly before calling `Path.GetFullPath` (a NIT from the
msbuild Expert MSBuild Review), and, if `McpClient.CreateAsync`
throws (e.g., `binlog-mcp` is not on PATH), write a stub
`{ "error": "DumpBinlog fatal: …" }` payload to each of the three
expected output JSON files so the agent's `cat` step always has
structured input to read. Without this, a fatal MCP-client failure
silently produced no files at all and the agent could not even
report the failure.
The lock file (`build-failure-analysis.lock.yml`) is regenerated by
`gh aw compile --strict`.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
YuliiaKovalova
added a commit
to microsoft/testfx
that referenced
this pull request
May 21, 2026
…vers
The previous CI failure at "Start MCP Gateway" (run 26238445223,
job 77219093390) was caused by gh-aw emitting malformed JSON for a
`mcp-servers.nuget: { command: ..., args: [...] }` (process / stdio MCP
server) declaration: the generated config contained `"nuget": {` with
no body and no closing brace, immediately followed by `"safeoutputs":`,
which crashed the gateway with "Expected ',' or '}' after property
value in JSON at position 1001 (line 42 column 1)".
This change rewires NuGet MCP Server the same way `dotnet/sdk` and
`dotnet/msbuild` use it:
* `NuGet.Mcp.Server` is installed as a dotnet global tool in a
pre-agent step (already present here).
* It is listed in the workflow's `bash:` allowlist so the agent can
invoke `NuGet.Mcp.Server` directly via the shell tool.
* The MCP gateway never sees a `nuget` server entry, so the
malformed-JSON path that crashes the gateway is bypassed entirely.
Verified: the regenerated `*.lock.yml` MCP config now contains only the
`github` and `safeoutputs` server entries and parses as valid JSON.
Also applies the same Copilot-review fixes that landed in
`dotnet/sdk#54401` and `dotnet/msbuild#13835`:
* Triple-backtick fence inside YAML `description:` blocks rewritten
as inline backtick spans so the description is not interpreted as
an unterminated code fence.
* `${{ github.* }}` expressions inside the agent prompt body
rewritten as runtime env vars (`${GITHUB_SERVER_URL}`,
`${GITHUB_REPOSITORY}`, `${GITHUB_RUN_ID}`). The agent prompt body
is passed verbatim to the LLM and is not YAML, so MSBuild-style
expressions stay literal.
* Agent description corrected to point at the C# `DumpBinlog`
helper instead of the removed `dump-binlog.js`.
* `Locate binlog` step now resolves the binlog path to absolute via
`realpath` so downstream consumers can treat `GH_AW_BINLOG_PATH`
as absolute, and the `Dump binlog as JSON` step drops the redundant
`$GITHUB_WORKSPACE/` prefix.
* Bumped the `dotnet run` timeout for `DumpBinlog` from 120s to 180s
to match the sdk/msbuild workflows (large binlogs in those repos
needed the extra headroom).
* `DumpBinlog/Program.cs` rejects an empty-string binlog argument
before calling `Path.GetFullPath`, and if `McpClient.CreateAsync`
throws it now writes a stub `{ "error": "DumpBinlog fatal: …" }`
payload to each of the three output JSON files so the agent's
`cat` step always has structured input to read.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…otnet
- Remove NuGet.Mcp.Server from bash tools list (causes empty
'nuget': {} block in MCP Gateway JSON, crashing the gateway)
- Add 'dotnet' to bash allowlist so agent can invoke global tools
- Update agent to use 'dotnet NuGet.Mcp.Server' instead of stdin JSON
- Recompile both lock files with gh aw compile --strict
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
the output example: YuliiaKovalova/dotnet#3 (comment) |
…ogMcp
The dotnet-tools feed package that backs the `binlog-mcp` CLI has been
renamed from `AITools.BinlogMcp` to `Microsoft.AITools.BinlogMcp` (now
under the canonical `Microsoft.*` namespace). The CLI command exposed
by the tool (`binlog-mcp`) is unchanged, so only the
`dotnet tool install --global …` line and the package version need to
move.
* Bump `BINLOG_MCP_VERSION` from `1.0.0-preview.26268.3` to
`1.0.0-preview.26272.1` (the first version published under the new
package id).
* Update `dotnet tool install --global` invocations in both
`build-failure-analysis.md` and `build-failure-analysis-command.md`.
* Update the AzDO feed permalink in the agent's comment-footer
template to point at the new package.
* Regenerate `.lock.yml` files via `gh aw compile --strict`.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a gh-aw (GitHub Agentic Workflows) workflow that runs on every PR: when
./build.sh --binaryLogfails, an AI agent reads the binlog and inspects NuGet package metadata through two MCP servers --binlog-mcp-- structured access to the MSBuild binary log (errors, warnings, target timeline, project graph).NuGet.Mcp.Server-- package version / dependency / vulnerability lookup against nuget.org, used to disambiguate package-related failures (NU1605 downgrades, MSB3277 conflicts, runtime/aspire/templating pin drift across the SDK).-- groups errors by root cause, and posts:
suggestionblocks tied to the diff for changes the reviewer can accept with one click.The workflow is advisory only -- it never gates the merge, and is fork-PR-safe (
forks: []skips outside-contributor PRs).What's in the PR
.github/agents/build-failure-analyst.agent.md-- repo-tailored agent prompt covering SDK-specific concerns: CLI host + sub-commands undersrc/Cli, generated manpages (documentation/manpages/sdk), the boundary withdotnet/templatingfor template-engine code, the local toolchain at./.dotnet/, and the rule that.xlffiles are auto-generated from.resxand must not be hand-edited..github/workflows/build-failure-analysis.md(+ generated.lock.yml) -- pull-request trigger on[main, release/**]; runs the build withcontinue-on-error, installs the two MCP servers (Microsoft.AITools.BinlogMcp+NuGet.Mcp.Server) asdotnet tools, dumps the binlog to JSON via theDumpBinloghelper, and delegates to the analyst agent.NuGet.Mcp.Serveris registered with gh-aw as a long-running MCP service the agent can call during analysis. Timeout 60 min (SDK builds are heavier than typical)..github/workflows/build-failure-analysis-command.md(+.lock.yml) --/analyze-build-failureslash command for re-running the analysis after force-pushes. Restricted toadmin/maintainer/writeroles..github/workflows/shared/build-failure-analysis-shared.md-- shared delegation body imported by both workflows..github/workflows/scripts/DumpBinlog/-- standalone C# console app (net9.0,ModelContextProtocol1.3.0) that speaks MCP stdio tobinlog-mcpand writesoverview.json/errors.json/warnings.json. Used as a pre-agent step because the gh-aw MCP gateway does not support non-containerized stdio MCP servers..github/aw/actions-lock.json-- minor bump ofgh-aw-actions/setup(v0.71.5 -> v0.74.8), side effect of the newergh awCLI used to compile the new workflows.How it runs
./build.sh --binaryLog -c Release..binlogis produced (Arcade default locationartifacts/log/<config>/Build.binlog),DumpBinlogextracts errors/warnings/overview to/tmp/binlog-data/*.json.NuGet.Mcp.Serverfor the actual published versions, dependency graphs, and known advisories.hide-older-comments); up to 10 inline suggestion comments are added.Safety properties
forks: []-> agent never runs against an outside-contributor PR (no token leakage).timeout_minutes: 60-> hard cap.permissions: contents: read, pull-requests: write-> minimum needed.[defaults, dotnet].[admin, maintainer, write].Why this is useful for the SDK
The repo's most common build failures cluster into a few categories where an LLM with the binlog can do meaningful triage:
src/Cliand tests.CoreCompile(generated manpages, generated resource accessors)..xlfmismatches when a.resxwas edited but localization wasn't synced.net472vsnet10.0) in shared helper projects.Grouping these by root cause and posting an actionable suggestion saves a round trip for the contributor.
Coexistence with existing workflows
This sits alongside the existing agentic workflows in
.github/workflows/(add-tactics-template-on-comment,fix-completions-on-comment,detect-netsdk-diagnostics, etc.). It uses the same.github/aw/actions-lock.jsonregistry and the same network allowlist conventions.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com