Project: ManagedCode.MarkItDown Stack: .NET 10, C# 14, xUnit, Spectre.Console, Azure/OpenAI integrations, GitHub Actions
Follows MCAF
Learn the user's habits, preferences, and working style. Extract rules from conversations, save to "## Rules to follow", and generate code according to the user's personal rules.
Update mechanism:
- Before any task, evaluate the latest user message.
- If a new permanent rule/correction/preference/process appears, update
AGENTS.mdfirst. - Put new persistent rules under
## Rules to follow. - If no new permanent rule appears, do not update this file.
Extraction guidelines:
- Add rules when user states "never", "always", "remember", "the process is", "from now on".
- Treat strong frustration/repetition as high-priority permanent rules.
- Ignore one-off/temporary instructions (for this task only).
- build:
dotnet build MarkItDown.slnx - test:
dotnet test MarkItDown.slnx - format:
dotnet format MarkItDown.slnx - analyze:
dotnet build MarkItDown.slnx -p:RunAnalyzers=true - coverage:
dotnet test MarkItDown.slnx --collect:"XPlat Code Coverage"
- Always start from the architecture map in
docs/Architecture/Overview.md. - Define in-scope/out-of-scope before implementation.
- Use only the minimum required context; do not scan the whole repo without need.
- If a task matches an existing skill, follow that skill workflow.
- Analyze current behavior first, then implement.
- Implement code and tests together.
- If
buildis separate fromtest, runbuildbeforetest. - Run the full test suite after making changes and share results.
- When executing tests, always include
ManualConversionDebugTests; treat failures as blocking. - Always run required commands yourself; do not ask the user to run them.
- When asked to adopt an external tutorial/framework, execute the tutorial checklist end-to-end and document the result in repo docs.
- All docs live in
docs/. - Global architecture entry point:
docs/Architecture/Overview.md. - Keep architecture docs navigational (diagrams + links), and push detailed behavior to
docs/Features/*and decisions todocs/ADR/*. - Single source of truth: link instead of duplicating large blocks across docs.
- Remove template placeholders (
TEMPLATE ONLY,TODO,...) in real docs. - Update feature docs when behavior changes.
- Update ADRs when architecture/contract decisions change.
- Keep Mermaid diagrams valid and renderable.
- Installer sync rule: if install-relevant assets change (
scripts/mcaf-install.sh,docs/templates/,skills/), keep remote installer/docs consistent.
- Prefer integration/API/UI tests over isolated unit tests for behavior verification.
- Use real dependencies for internal systems; mocks are only for external third-party systems.
- Every behavior change must have meaningful automated test coverage.
- Do not delete/weaken tests to make CI green.
- Flaky tests are failures; fix root cause.
- Use coverage to find blind spots, not as a vanity metric.
- Integration/live tests must fail loudly on real provider/auth/network errors; never swallow with
catch { return; }.
- Never introduce fallback logic that silently overrides user or config values; surface configuration errors instead of masking them in code.
- Keep
SegmentOptions.MaxParallelImageAnalysisatMath.Max(Environment.ProcessorCount * 4, 32)and do not downscale it via runtime fallbacks. - Treat non-positive
SegmentOptions.MaxParallelImageAnalysisvalues as configuration errors—fail fast instead of defaulting to unlimited concurrency. - Ensure document segments remain in source order with explicit numeric page/segment metadata—avoid relying on labels like "Page 1".
- When extracting images (or other artifacts), persist them to disk when a target path is supplied and record the file path in artifact metadata.
- Generate Markdown output from the ordered segment collection so it always reflects current segment content; avoid storing stale Markdown snapshots.
- Allow
ConvertAsync(and related entry points) to accept caller-supplied options for AI/config overrides on a per-document basis. - MIME handling: always use
ManagedCode.MimeTypesfor MIME constants, lookups, and validation logic. - Treat this repository as a high-fidelity port of
microsoft-markitdown: every fixture copied from upstreamtests/test_files/must be referenced by .NET tests (positive conversion or explicit unsupported case). - CSV parsing must use the
Seplibrary; avoid Sylvan or other CSV parsers for new/updated code. - Format integration tasks: never break the project or existing tests, and validate new format handling against real sample files.
- Test fixtures must be surfaced via generated
TestAssetCatalog; add binaries underTestFiles/and use catalog constants in tests. - YouTube converter work must include at least one live integration test using the real metadata provider (skip gracefully if upstream API unavailable).
- Media routing: if
StreamInforesolves toaudio/*orvideo/*(uploaded media), do not route throughYouTubeUrlConverter. - For
video/*inputs, do not use local audio-transcriber fallback; enforce Azure Video Indexer media-provider flow (upload, wait forProcessed, then read transcript/index). - Azure Video Indexer output quality: include rich video analysis in markdown (timings, speakers, sentiment/emotion signals, and topic/keyword/context summaries), not transcript-only text blocks.
- Azure Video Indexer fixes: when a working reference client exists in
diwo/, mirror its proven auth/token + processing-state flow before introducing alternative logic. - Never introduce test-only abstractions like
IAzureIntegrationSampleResolverinto core production library code. - Image enrichment tasks: after OCR, send artifacts through shared
IChatClientprompt constants; produce detailed visual descriptions, Mermaid/tables for diagrams, and MetaMD-compliant markdown. - Image AI enrichment must reject missing MIME metadata.
- After AI image enrichment, strip legacy/fallback image comments so final markdown has one canonical image placeholder + description.
- Front matter titles must ignore metadata/image-description comments and use first real document text.
- Intelligence helper refactors must return explicit result objects rather than relying on hidden side effects.
- Image placeholders must use markdown image links (
) when persisted files exist; use bold fallback only when no file exists. - If AI image enrichment returns no insight, log and continue (soft failure).
- Converter selection diagnostics: when a converter throws in
ConvertAsync, include converter name and detected mime/extension in user-visible failure details. - Converter selection diagnostics: always emit an
Informationlog entry with the selected converter name when conversion starts for a candidate input. - Converter failure classification: if provider/converter chain fails with authentication/authorization (
401/403or credential auth errors), surfaceFileConversionExceptionwith auth context instead of masking it asUnsupportedFormatException. - Converter-routing tests: assert behavior (selected path/provider and typed exceptions), not brittle message include/exclude string checks.
- Project cleanup/refactor tasks must be executed step-by-step: audit converter groups (agent-assisted when available), apply incremental refactors, then run full regression tests.
- Telemetry changes must instrument both overall document duration and per-page duration with trace + metric coverage.
- For large converters, use partial classes and dedicated subfolders.
- Markdown hygiene: strip non-printable spaces (NBSP/ZWSP/etc.) and replace with ASCII spaces.
- Architecture revamps: prefer DI-first composition, per-request cloud model selection, and
System.IO.Pipelines-compatible scheduling. - DOCX processing changes should preserve pipeline parallelism and output ordering.
- URL APIs must expose
Urioverloads in addition to string forms. - Manual Azure config defaults: do not auto-populate
AzureIntegrationConfigDefaultsfrom environment variables. - Azure Video Indexer config binding should use explicit settings objects/JSON values (including
ArmAccessToken) and must not injectAccountName/resource identifiers from environment-variable fallbacks. - AzureIntelligence integration tests: do not source ARM tokens from
diwo; keep Video Indexer auth in explicitHardcodedVideoIndexerOptions. - Azure Video Indexer live tests in this repo must use explicit
HardcodedVideoIndexerOptions(includingArmAccessToken) and must not require Azure CLI / device login /DefaultAzureCredential. - Azure Video Indexer validation must prove server-side indexing: extract
videoIdfrom transcript metadata and verify/Videos/{videoId}/Indexreachesstate=Processedwith transcript entries. - Never use
MemoryStreamfor conversion paths; rely on file-based processing. - Disk-first refactors: shared disk/workspace helpers go to reusable base classes, not nested per-converter helpers.
- Document pipeline changes must align with
docs/DocumentProcessingPipeline.mdand keep shared setup centralized. - Manual conversion diagnostics must persist output to disk and keep MetaMD image description blocks.
- Multi-page tables must emit continuation comments and populate
table.pageStart,table.pageEnd,table.pageRangemetadata. - PDF converters must honor
SegmentOptions.Pdf.TreatPagesAsImagesby rendering pages to PNG, running OCR/vision enrichment, and composing image+recognized-text segments. - Persist conversion workspaces through
ManagedCode.Storagewith sanitized per-document folders and store extracted artifacts + final markdown there. - Media upload integrations must use
ManagedCode.Storage.Core.IStoragedirectly (factory/options flow); do not introduce custom URL-upload provider abstractions whenIStorageis available. - After feature/refactor work, delete orphaned/unused code files and stale abstractions immediately; do not leave dead code in the repository.
- In storage-related tests, use real
ManagedCode.Storageimplementations (for exampleLocalStorage) instead of custom storage stubs when feasible. - Root path configurability:
MarkItDownPathResolvermust support configurable root viaMarkItDownOptions.RootPathorMarkItDownServiceBuilder.UseRootPath(), with lock-guarded atomic initialization and conflict exceptions.
- Start work immediately; ask questions only for true blockers not discoverable from code/docs.
- Report status when tasks are complete or when blocked by external dependencies.
- Be direct and factual; no fluff.
- Challenge weak assumptions and call out risks.
- If unsure, state uncertainty and propose verification.
- Treat quality and security regressions as blockers.
- Follow
.editorconfigand existing repository conventions. - Prefer explicit constants/config over magic literals.
- Keep public APIs documented and errors actionable.
- Never commit secrets, API keys, or customer data.
- Never mock internal systems in integration tests.
- Never skip tests to force green CI.
- Never force-push to
main. - Never approve/merge PRs automatically.
Always:
- Read
AGENTS.mdand relevant docs before edits. - Run required verification commands before finalizing.
Ask first:
- Public API contract changes
- New third-party dependencies
- Database/schema changes
- Deleting code files
- Clear diagnostics and deterministic converter routing.
- Architecture-first implementation with documented flows.
- Silent fallbacks that hide configuration or provider failures.
- Brittle tests coupled to exact exception text formatting.