docs(ai-agents): design doc for context-aware next-step guidance#8057
docs(ai-agents): design doc for context-aware next-step guidance#8057antriksh30 wants to merge 4 commits intomainfrom
Conversation
2bb0e25 to
9afe882
Compare
Initial draft of the design for issue #7975 context-aware Next: guidance across azd ai agent commands plus a new `azd ai agent doctor` diagnostic. Scoped entirely to cli/azd/extensions/azure.ai.agents/. References core files for context but proposes no edits outside the extension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9afe882 to
2329299
Compare
There was a problem hiding this comment.
Pull request overview
Adds a design document proposing context-aware “Next:” guidance for successful azd ai agent commands and a new azd ai agent doctor diagnostic within the azure.ai.agents extension, including a proposed nextstep package, state-resolution model, and per-command decision trees.
Changes:
- Introduces an end-to-end architecture for resolving and printing state-aware next-step suggestions (including deploy-time artifact note embedding).
- Defines the proposed state model, command-specific resolution policy, output/TTY discipline, and OpenAPI-derived invoke payload extraction strategy.
- Outlines an MVP
doctorcommand with a phased checklist plan, plus testing strategy, risks, and alternatives.
Companion to azd-ai-agent-nextsteps.md. Covers the five live-Azure doctor checks deferred from the MVP: auth, Foundry reachability, model deployments, RBAC, and agent runtime status. Key decisions: - Opt-in via --remote flag (default stays local-only, fast, no auth) - Strict dependency matrix; failed prerequisites become explicit Skip results - Per-check timeouts (~10s); ~30s worst-case walltime for --remote - Read-only RBAC fix is surfaced as az CLI command, never auto-applied - Phased rollout: runner split, auth+reach, models+status, RBAC last Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wbreza
left a comment
There was a problem hiding this comment.
🔍 Design Doc Review
Thanks for the thorough design, @antriksh30! Strong technical depth with excellent extension-boundary discipline. Below are findings from a multi-perspective review (architecture, PM, testing, documentation).
P0 — Must Fix Before Merge (5)
1. RBAC principal ID + subscription path leakage
📄 �zd-ai-agent-doctor-remote-checks.md — Check 10
Check 10's fix command templates --assignee and --scope into stdout. In CI, these end up in persistent build logs — a credential-adjacent leak. The design needs a security-aware output contract: redact in non-interactive mode, or route sensitive fix commands to --output json only.
2. Check 7 auth mechanism contradicts between documents
📄 Both docs — Check 7
�zd-ai-agent-nextsteps.md says �z account show; �zd-ai-agent-doctor-remote-checks.md says �zidentity.NewAzureDeveloperCLICredential. These are different mechanisms with different failure modes. Suggest aligning on AzureDeveloperCLICredential since this is an �zd extension.
3. Parallel fan-out contradicts dependency matrix
📄 �zd-ai-agent-doctor-remote-checks.md — Risks table vs Dependency matrix
Risks table claims checks 8+10 can run concurrently, but the dependency matrix shows check 10 depends on check 8. Both can't be true. Either remove 8 as a dependency of 10 (RBAC calls ARM, not Foundry) or correct the risks table.
4. Security considerations section missing entirely
📄 �zd-ai-agent-doctor-remote-checks.md
A feature handling Azure credentials, UPN display, and CI output needs an explicit security/threat section covering: what sensitive data appears in output, behavior in CI/non-interactive contexts, and telemetry exclusions for PII fields.
5. �gent.yaml assumption contradicts issue #7975
📄 �zd-ai-agent-nextsteps.md
Issue #7975 states "assumes unified �zure.yaml" (agent.yaml gone), but the design references �gent.yaml throughout. Add an "Assumptions" section clarifying which config schema the design targets and what happens when unification lands.
P1 — Should Fix (17)
Expand P1 findings
| # | File | Finding |
|---|---|---|
| 6 | nextsteps | State model missing IsAuthenticated field — resolver can't distinguish auth states; implementer will hit this mid-build |
| 7 | remote-checks | Check 11 dependency on check 10 too strict — Reader-role users get agent status skipped unnecessarily; consider removing check 10 from 11's deps |
| 8 | remote-checks | Check 8 conflates auth failure with reachability — 403 could be wrong-tenant, not missing RBAC; update fail message to mention both possibilities |
| 9 | nextsteps | Stray trailing triple-backtick at end of file corrupts markdown rendering |
| 10 | nextsteps | Duplicate "Multi-line note pre-indentation" heading — copy-paste error, remove duplicate |
| 11 | nextsteps | Broken #alternatives-considered anchor — links to a section that doesn't exist; add the section or remove the link |
| 12 | nextsteps | Backward compatibility not addressed — new Next: stdout output could break scripts; add a "Backward Compatibility" section |
| 13 | remote-checks | No --output json for doctor --remote — CI consumers need machine-readable doctor results |
| 14 | both | No doctor exit codes specified — CI can't gate on doctor (0=pass, 1=fail, 2=skip) |
| 15 | both | No test strategy for TTY/pipe/JSON output routing — core correctness contract needs test coverage |
| 16 | both | No test for non-interactive skip path — isTerminal detection failure in CI = hanging process |
| 17 | remote-checks | Runner skip-cascade testing missing — dependency matrix logic is the most complex part |
| 18 | both | Doctor file placement inconsistency — nextsteps says doctor.go, remote-checks says doctor/ subpackage |
| 19 | nextsteps | No user personas — design can't distinguish new developer from CI platform engineer |
| 20 | nextsteps | No success metrics/KPIs — no way to measure feature effectiveness |
| 21 | both | No telemetry/observability plan — check results should be instrumented for iteration |
| 22 | remote-checks | Foundry API version not specified — "pin api-version" without naming which one |
P2 — Nice to Have (9)
Expand P2 findings
- Missing newline at EOF (remote-checks)
- Orphaned period at start of line in deploy decision tree
- Relative links to extension source paths need verification
- Trailing whitespace on multiple lines
- Check 1 "extension running" is tautological — consider renaming to "gRPC channel health"
- Adversarial/inconsistent State inputs not in test strategy
- Multi-agent cap of 5 is unjustified — consider data-backed threshold or making configurable
- OpenAPI $ref resolution not mentioned in extraction logic
- Check 9 model version mismatch deferred — could be low-cost Warn since data is already fetched
Strengths
- Extension-boundary discipline is excellent — no modifications outside the extension
- Decision trees are tabular and testable
- Artifact-note path for deploy is pragmatic and well-justified
- Phased implementation with independent shippability is the right call
- Risk tables are realistic, not boilerplate
- Cross-references between companion docs are symmetric and correct
jongio
left a comment
There was a problem hiding this comment.
@wbreza already covered this thoroughly - adding a few things that review didn't flag.
Artifact integration mechanism is underspecified (the most important gap)
The doc repeatedly says "attach Next: to the artifact's note field" and "populate the artifact's note field" - but azdext.Artifact has no Note field. The proto defines: Kind, Location, LocationKind, and Metadata (map). Today's deploy artifacts are ARTIFACT_KIND_ENDPOINT with Metadata["label"] rendered as - label: url.
For the Next: block to work through this channel, the design needs to specify:
- Which
ArtifactKindcarries freeform text? A new enum value likeARTIFACT_KIND_NOTE? Or reusing an existing one? - How does core's
MessageUxItemrenderer know to display multi-line text instead of a- label: urlline? - If using
Metadata["note"]- does core's rendering code already handle that key specially, or does it need changes (which contradicts the "no edits outside the extension" scope)?
This is the one integration point between the extension and core that isn't fully self-contained - and the design doc's scope claim ("no files outside the extension are modified") depends on getting this right.
State model type inconsistency
HasUnresolvedInfraVars is bool but HasUnresolvedManualVars is []string. The init resolver needs manual var names to emit azd env set <KEY> <value>. But if infra vars are unresolved, knowing which Bicep outputs are missing would let the guidance be more specific than just "run azd provision".
Deploy payload derivation gap
The run decision tree has OpenAPI extraction logic for deriving the invoke payload. The deploy tree just says '' literally. Should the post-deploy Next: block attempt OpenAPI discovery against the remote endpoint, or reference the README? The run case and deploy case have asymmetric payload handling without explaining why.
Synthesized across 4 model critiques (claude-opus-4.7, claude-sonnet-4.6, gpt-5.5, gpt-5.4). Doc-only changes; no implementation churn. azd-ai-agent-nextsteps.md - Status: anchor artifact mechanism on `Metadata["note"]` of an `ArtifactKindEndpoint` artifact rendered by `pkg/project/artifact.go` (no core edits). - Drop broken `[Alternatives](#alternatives-considered)` link; rephrase Non-goals to keep the "no core edits" point explicit. - Add Assumptions section: design assumes unified `azure.yaml` baseline per #7975; `agent.yaml` references are placeholders for the equivalent unified service-stanza fields. Pin auth to `azidentity.NewAzureDeveloperCLICredential` (matches `agent_context.go`). - Component layout: align `doctor.go` to `doctor/` subpackage matching the remote-checks doc. - State Model: rename `HasUnresolvedInfraVars bool` to `MissingInfraVars []string`; mirror `MissingManualVars`. Add tri-state `IsAuthenticated` (Unknown/Authed/Unauthed) populated only via `WithAuthProbe` opt-in; default `Unknown` means "skip auth-conditional advice", not "tell user to log in". Note `HasOpenAPI`/`OpenAPIPayload` only populated by `run`/`doctor --remote` paths. - Decision Tree: rename `init` condition to `len(MissingInfraVars) > 0`; drop `monitor --follow` secondary from `invoke --local` success (monitor is hosted-only — verified in `monitor.go`); deploy resolver uses cached spec only, no live probe at deploy time (agent may not be reachable yet). - Output discipline: name `isTerminal` source of truth (`helpers.go` wrapper around `golang.org/x/term.IsTerminal`). Reconcile JSON-mode contradiction: `--output json` on success-path commands suppresses the human `Next:` block; `doctor --output json` is a separate structured emit with explicit `redacted` field. Remove duplicate "Multi-line note pre-indentation" heading. - Hook Points: clarify deploy-row mechanism cites the existing endpoint-kind artifact emitted by `service_target_agent.go` and the renderer at `pkg/project/artifact.go` (`ToString`); no new artifact kind. - OpenAPI Discovery: note silent-mode requirement for `fetchOpenAPISpec` when called by the resolver path (currently prints "OpenAPI spec saved to..." at `helpers.go:373`). Mark `$ref` resolution as out of scope — fall back to `<payload>` literal. - Doctor checks: rename check 1 to "extension gRPC channel healthy"; point check 7 at `azidentity.NewAzureDeveloperCLICredential`; clarify check 6 placeholder for unified-schema service stanza. - Add `Exit codes & JSON output` subsection: precedence rules (any fail -> 1, skip-only -> 2, otherwise 0). JSON envelope sketch with `schemaVersion`, `remote`, `redacted`, `checks[]`. - Testing: TTY/pipe/JSON snapshot routing; non-interactive auth-skip; `$ref` -> `""` extractor; cached-spec vs absent fallback; explicit skip-cascade test over the dependency matrix; exit-code precedence + JSON envelope shape. - Add Backward Compatibility section. - Add Security & PII / Telemetry section: redaction rules; `isTerminal` source-of-truth; logging policy; brief deferred-telemetry sketch. - Add Deferred follow-ups: personas/KPIs/full-telemetry-plan rationale. - Risks: replace cap-of-5 mitigation with a wording that doesn't pin a magic number — fall back to a single `azd ai agent show` line if multi-agent walls become a real problem. azd-ai-agent-doctor-remote-checks.md - Dependency matrix: drop check 8 from check 10 deps (RBAC reads ARM, not Foundry data plane); drop check 10 from check 11 deps (agents-list is Reader-level). - Check 8: pin api-version to `DefaultAgentAPIVersion` constant in `internal/cmd/agent_context.go` (currently `2025-11-15-preview`); add 401 case (token expired); broaden 403 wording to "wrong tenant or insufficient RBAC". - Check 10: add redaction rule for non-interactive output; flag in JSON envelope. - Risks: reconcile parallel-fan-out claim post matrix simplification; add identifier-leakage row pointing at the redaction rule. - Testing: add explicit skip-cascade test over the dependency matrix; redaction snapshot tests for interactive vs. non-interactive RBAC. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks @wbreza, ran a multi-model critique on the fix plan first then applied. Status of your list (commit 4358e68): P0 (5/5): all addressed — RBAC redaction (Check 10) ✅ · auth alignment to P1 (14/17 + 3 explicitly deferred): P2 (4/9): Check 1 → "gRPC channel health" ✅ · multi-agent cap-of-5 replaced ✅ · OpenAPI Plus jongio's three: anchored Status block on |
📋 Prioritization NoteThanks for the contribution! The linked issue isn't in the current milestone yet. |
jongio
left a comment
There was a problem hiding this comment.
Addresses my previous feedback. The new commit clarifies all three gaps I flagged:
- Artifact mechanism - now explicitly specifies
Metadata["note"]onArtifactKindEndpoint, referencespkg/project/artifact.goToString(~line 128). Verified this path exists and works as described. - State model asymmetry - renamed to
MissingInfraVars []stringandMissingManualVars []string(both[]stringnow). - Deploy payload - added "No live OpenAPI probe at deploy time" paragraph with the fallback chain (cached spec -> README -> generic literal).
Also good additions: AuthState enum with AuthUnknown default (avoids login noise), WithAuthProbe opt-in, exit code table, JSON envelope schema, redaction rules, security section, backward compatibility section, and the $ref out-of-scope note.
One minor observation: the isTerminal helper is referenced as living in internal/cmd/helpers.go - worth confirming the exact function name in a code comment so implementers don't have to grep for it.
|
|
||
| **Draft — for design review.** | ||
| Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975) — "Context-aware next-step guidance and diagnostics". | ||
| Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — specifically the `Metadata["note"]` key on an `ArtifactKindEndpoint` artifact, which `pkg/project/artifact.go` already renders as a continuation line under the artifact bullet (gated on `Kind == ArtifactKindEndpoint`). The extension already emits endpoint-kind artifacts at `internal/project/service_target_agent.go`; we populate the `note` metadata key on the same artifacts. |
There was a problem hiding this comment.
@rajeshkamal5050 Is this a good candidate and time for us to do azure.ai.projects extension and put this as a shared framework for next steps suggestions? Eventually this looks like something that could be provided by azd core, but for now, we can at least share it across azd ai * extensions.
There was a problem hiding this comment.
Discussed in the large chat, I think for now let's leave it as part of the agent extension focused on the agent commands, and we can expand later.
- Flip flag model from `--remote` opt-in to `--local-only` opt-out. Default `azd ai agent doctor` now runs all checks (1–12); `--local-only` is the CI / offline opt-out. Update Goal, Execution Model, Output Format example, Performance Budget, Implementation Phases, and the Appendix accordingly. - Add Terminology section defining MVP explicitly as "local checks 1–6 ship first" — chronological, not a permanent capability cap. - Soften the local-only framing in Background: clarify that remote checks intentionally make network calls under user credentials, which is the whole point of this doc. - Add Check 12: Agent identity role assignments. Lists role assignments on the agent's managed identity at project / account / resource-group scopes, with a new INFO glyph for the informational case. Distinct from check 10 (user RBAC) — different principal, different fix command, different preconditions. Rename check 10 to "User RBAC" for symmetry. - Add a tracking note on Check 9 referencing #7962 (`agent.yaml` → `azure.yaml` unification, syncing with @trangevi). - Rewrite the first Risks row: replace the "users in CI miss `--remote`" scenario (no longer applicable) with a precise non-interactive auth failure path that surfaces a clear "re-run with --local-only" message. - Update companion `azd-ai-agent-nextsteps.md` for cross-reference consistency: doctor table extended to checks 7–12; flag references flipped (`--local-only` is the new opt-in flag, full sweep is default); comments on `WithAuthProbe` updated to match. Resolves comments #1–7 from @therealjohn on PR #8057. Comment #8 (scope: move to `azure.ai.projects`?) is deferred for @rajeshkamal5050. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…odes, flags) Adds the user-facing surface called for in PR #8057's design doc so the doctor command reaches feature-parity with the MVP spec ahead of the remote-checks (7-12) follow-up: * --output json: emits a stable JSON envelope (schemaVersion 1.0) with per-check id, title, status, detail, fix, and durationMs. Suppresses the human Next: block so machine consumers see a flat schema. * --local-only: surface to opt out of remote checks. No-op today (only checks 1-6 are implemented) but stable for forward compat. * --unredacted: interactive-only escape hatch for raw values in the JSON envelope. Always honored; auto-redacts when stdout is not a TTY. * Exit codes: 0 (pass+warn+skip mix with at least one non-skip), 1 (any fail), 2 (every check skipped). Wired via os.Exit so the host's error reporter doesn't add a stray 'Error:' line on top of the clean doctor report. * Stable per-check IDs (e.g. local.azd-cli, local.project-endpoint) for the JSON envelope. * Per-check duration capture via a small timed() wrapper. Tests: TestComputeDoctorExitCode (table-driven), TestWriteDoctorJSON envelope shape and remote flag, TestDoctorStatusString, TestDoctor_TimedStampsDuration, TestDoctorCommand_FlagsRegistered. All existing tests still green. Refs: #7975 Refs: #8057 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jongio
left a comment
There was a problem hiding this comment.
All three items from my earlier review are addressed. Latest commit (flag model flip to --local-only, check 12 for agent identity roles, terminology section) introduces no new issues. Approving.
trangevi
left a comment
There was a problem hiding this comment.
Looks good enough to proceed in my opinion. I'm sure other stuff might come up during implementation, but don't think we need to block on getting a perfect spec
|
|
||
| **Draft — for design review.** | ||
| Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975) — "Context-aware next-step guidance and diagnostics". | ||
| Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — specifically the `Metadata["note"]` key on an `ArtifactKindEndpoint` artifact, which `pkg/project/artifact.go` already renders as a continuation line under the artifact bullet (gated on `Kind == ArtifactKindEndpoint`). The extension already emits endpoint-kind artifacts at `internal/project/service_target_agent.go`; we populate the `note` metadata key on the same artifacts. |
There was a problem hiding this comment.
Discussed in the large chat, I think for now let's leave it as part of the agent extension focused on the agent commands, and we can expand later.
|
|
||
| ## Assumptions | ||
|
|
||
| This design assumes the unified-`azure.yaml` proposal in [#7975](https://github.com/Azure/azure-dev/issues/7975) has landed: a single `azure.yaml` is the canonical config; `agent.yaml` and `agent.manifest.yaml` no longer exist as separate files. References to `agent.yaml` in the doctor checks and decision trees below are placeholders for the equivalent fields in the unified `services.<name>` stanza. If the extension ships before unification, each `agent.yaml` reference should be read as the corresponding fields under `services.<name>` in `azure.yaml`. The `AssembleState` function abstracts the source split, so resolvers are insulated from the migration — only the state-assembly code needs to change. |
There was a problem hiding this comment.
We will not have a unified azure.yaml by //build, so for now this should proceed with the azure.yaml and agent.yaml (definition).
| | 3 | `azure.yaml` services, protocols, `config.env` refs | Structural truth | | ||
| | 4 | azd env vars | Resolution truth | | ||
|
|
||
| If the azd environment is missing or stale (a real scenario for brownfield users), the resolver should still produce useful output from layers 1–3. |
There was a problem hiding this comment.
How will "stale" be determined?
| | Scenario | Destination | | ||
| |---|---| | ||
| | Interactive TTY | `stdout` | | ||
| | Piped output / CI (detected via `isTerminal`) | `stderr` so it doesn't pollute parsable output | |
There was a problem hiding this comment.
I don't really like the idea of this showing up as an error, I feel like that would potentially confuse anyone looking at the output to determine what happened in case of a failure/might confuse an agent thinking that these are failures. If anything, wouldn't we want this to be part of the parseable output, so it helps guide what is next?
| | `init` | end of `runE` after success message | direct `nextstep.PrintNext(w, …)` | | ||
| | `run` | "agent is listening" callback | direct print before `Press Ctrl+C` line | | ||
| | `invoke --local` / `invoke` | end of `runE` on success | direct print | | ||
| | `show` | end of table render | direct print | |
There was a problem hiding this comment.
why not put everything at the end of the runE if possible? I understand some commands like run and deploy are unique, but most I think it would make sense to put in the same place for consistency?
Fixes #7975 initial draft design doc context-aware
Next:guidance after every successfulazd ai agentcommand, plus a newazd ai agent doctordiagnostic.Scope: entirely inside
cli/azd/extensions/azure.ai.agents/. References core files for context but proposes no edits outside the extension.Doc covers:
nextsteppackage + thin command callersStatemodel with layered sources (flags > runtime probes > azure.yaml > azd env)azdext.ArtifactSDK field)