Skip to content

Ingestion re-implement on updated Elastic.Ingest.Elasticsearch#2755

Open
Mpdreamz wants to merge 29 commits intomainfrom
feature/ingest-rearch
Open

Ingestion re-implement on updated Elastic.Ingest.Elasticsearch#2755
Mpdreamz wants to merge 29 commits intomainfrom
feature/ingest-rearch

Conversation

@Mpdreamz
Copy link
Member

@Mpdreamz Mpdreamz commented Feb 22, 2026

Summary

Migrate Elasticsearch indexing to source-generated mappings via Elastic.Mapping and the IncrementalSyncOrchestrator from Elastic.Ingest.Elasticsearch, replacing ~2200 lines of hand-rolled ingest/enrichment code. Introduces a clear separation between build type ({type}) and environment ({env}) in all index and resource naming, with consistent resolution across write (indexing) and read (search) paths.

Key changes

  • Source-generated mapping contextDocumentationMappingConfig.cs declares index structure, field mappings, and analysis settings using [Index<T>] attributes. The source generator produces a typed CreateContext(type:, env:) factory, eliminating manual index name construction.
  • IncrementalSyncOrchestrator replaces the two manually managed ElasticsearchLexicalIngestChannel / ElasticsearchSemanticIngestChannel classes. Dual-index writes, alias rotation, and hash-based rollover detection are now handled by the library.
  • AI enrichment via AiEnrichmentOrchestrator — replaces the hand-rolled LLM client implementation (~1600 lines removed). Uses ES|QL COMPLETION for server-side inference with configurable CompletionTimeout (2 min) and CompletionMaxRetries (2). AI enrichment is now the default for all index commands (opt out via --no-ai-enrichment).
  • Centralized endpoint resolutionElasticsearchEndpointFactory resolves Elasticsearch URL, credentials, BuildType (DOCS_BUILD_TYPE env var, default "isolated"), and Environment (from DOTNET_ENVIRONMENT/ENVIRONMENT, default "dev") in one place. Both write and read paths use endpoints.BuildType and endpoints.Environment consistently.
  • Explicit env: parameter to CreateContext — all CreateContext calls now pass env: endpoints.Environment to avoid the library's ResolveDefaultNamespace() picking up raw DOTNET_ENVIRONMENT=Development without lowercasing.
  • Jina v5 dense embeddings added alongside existing ELSER sparse embeddings on the semantic index variant.

Resource naming convention

All Elasticsearch resources now follow a structured naming convention that includes both build type and environment:

Resource Example (type=assembler, env=dev)
Lexical backing index docs-assembler.lexical-dev-2025.10.23.120521
Lexical write alias docs-assembler.lexical-dev-latest
Semantic backing index docs-assembler.semantic-dev-2025.10.23.120521
Semantic write alias docs-assembler.semantic-dev-latest
Synonym set docs-assembler-dev
Query ruleset docs-ruleset-assembler-dev

DocumentationEndpoints configuration

Property Env var Default Description
BuildType DOCS_BUILD_TYPE "isolated" Build type: assembler, isolated, codex
Environment DOTNET_ENVIRONMENT / ENVIRONMENT "dev" Deployment environment for resource naming

Library versions

Elastic.Ingest.Elasticsearch and Elastic.Mapping: 0.17.1 → 0.34.5

Key capabilities from the library upgrades:

  • IncrementalSyncOrchestrator<T> — dual-index writes with coordinated alias rotation and hash-based rollover
  • AiEnrichmentOrchestrator — streaming IAsyncEnumerable<AiEnrichmentProgress> API for AI enrichment lifecycle
  • OnRolloverDecision callback — exposes IndexRolloverInfo (label, local/remote hash, rolled over status) for diagnostics
  • ExportResponseCallback / ExportMaxRetriesCallback — bulk response logging with atomic running totals per channel
  • CompletionTimeout / CompletionMaxRetries on AiEnrichmentOptions — configurable ES|QL completion parameters
  • Source-generated [ElasticsearchMappingContext] — typed mapping builders with CreateContext(type:, env:) factory
  • IndexVariant on [AiEnrichment] — target AI enrichment to specific mapping contexts (semantic only)
  • Fix [Text] + [JsonIgnore(WhenWritingNull)] — source generator now emits type: "text" so dot-path sub-fields merge as multi-fields, not object properties
  • Fix [Object] sub-type attribute traversal[Keyword(Normalizer)] on nested object types now correctly emitted in mapping

Integration test improvements

  • All integration tests (Search.IntegrationTests, Mcp.Remote.IntegrationTests) log endpoint URL, search index, and ruleset name upfront
  • When search returns 0 results, a CountAsync reports whether the index has documents at all
  • ElasticsearchEndpointFactory.Create() accepts optional buildType and environment parameters for explicit test configuration
  • SearchRelevanceTests uses ConfigurationFileProvider.CreateSearchConfiguration() instead of manually constructing search config

Deleted code

  • ElasticsearchIngestChannel.cs and ElasticsearchIngestChannel.Mapping.cs (~420 lines)
  • ElasticsearchLlmClient.cs and related AI enrichment hand-rolled implementation (~1600 lines)
  • ElasticsearchLlmClientTests.cs (304 lines)
  • Removed support for DOCUMENTATION_ELASTIC_INDEX environment variable

Net effect: 104 files changed, 2258 insertions, 3590 deletions

Test plan

  • dotnet build passes
  • ./build.sh unit-test passes
  • Index names resolve correctly with explicit env: parameter (no more Development in index names)
  • Search integration tests point to correct docs-assembler.semantic-dev-* index
  • AI enrichment targets semantic index only (not lexical)
  • Mapping workarounds removed after Elastic.Mapping 0.34.5 fixes
  • Full assembler index run completes without document_parsing_exception errors
  • Verify synonym set created as docs-assembler-dev
  • Verify ruleset created as docs-ruleset-assembler-dev
  • Verify search API resolves to correct read alias and ruleset
  • Verify AI enrichment cache naming: docs-assembler.semantic-{env}-latest-ai-cache

…mappings

Replace manual channel orchestration with IncrementalSyncOrchestrator<T> and
source-generated ElasticsearchTypeContext from Elastic.Mapping 0.4.0. Add field
type attributes ([Keyword], [Text], [Object], etc.) directly on DocumentationDocument
to drive the mapping source generator, replacing verbose manual JSON mappings.

- Update Elastic.Ingest.Elasticsearch 0.17.1 → 0.19.0, add Elastic.Mapping 0.4.0
- Add mapping attributes to DocumentationDocument and IndexedProduct
- Create DocumentationMappingConfig.cs with two Entity variants (lexical/semantic)
- Rewrite ElasticsearchMarkdownExporter to use orchestrator for dual-index mode
- Delete ElasticsearchIngestChannel.cs and ElasticsearchIngestChannel.Mapping.cs
- Remove unused ReindexAsync from ElasticsearchOperations
- Update SearchBootstrapFixture to use IngestChannel with semantic type context
Replaces `ElasticsearchOptions` with `DocumentationEndpoints` as the single source of truth for
Elasticsearch configuration across all API apps, MCP server, and integration tests.

- Adds `IndexName` property to `ElasticsearchEndpoint` with a field-backed getter defaulting to
  `{IndexNamePrefix}-dev-latest`.
- Creates `ElasticsearchEndpointFactory` in `ServiceDefaults` to centralize user-secrets and
  environment variable reading, eliminating the duplicated `72f50f33` secrets ID pattern.
- Registers `DocumentationEndpoints` as a singleton in `AddDocumentationServiceDefaults`.
- Updates `ElasticsearchClientAccessor` to accept `DocumentationEndpoints` instead of
  `ElasticsearchOptions`, supporting both API key and basic authentication.
- Updates all gateway consumers (`NavigationSearchGateway`, `FullSearchGateway`,
  `DocumentGateway`, `ElasticsearchAskAiMessageFeedbackGateway`) to use endpoint properties.
- Simplifies all three integration test files (`SearchRelevanceTests`,
  `McpToolsIntegrationTestsBase`, `SearchBootstrapFixture`) to use `ElasticsearchEndpointFactory`
  and `ElasticsearchTransportFactory`, removing manual config construction.
- Deletes `ElasticsearchOptions.cs` and removes `Microsoft.Extensions.Configuration.UserSecrets`
  from the Search project.
Move mapping context (DocumentationMappingContext, LexicalConfig, SemanticConfig,
DocumentationAnalysisFactory) from Elastic.Markdown to Elastic.Documentation so
both indexing and search derive index names from the same source. Add ContentHash
helper to avoid Elastic.Ingest.Elasticsearch dependency in Elastic.Documentation.

Remove IndexName from ElasticsearchEndpoint, add Namespace to DocumentationEndpoints.
ElasticsearchEndpointFactory resolves namespace from DOCUMENTATION_ELASTIC_INDEX env
var (backward compat), DOTNET_ENVIRONMENT, ENVIRONMENT, or falls back to "dev".

ElasticsearchClientAccessor derives SearchIndex and RulesetName from namespace
instead of parsing the old IndexName string. Remove ExtractRulesetName and all
hardcoded "semantic-docs-dev-latest" assignments from tests and config files.
Enable IndexPatternUseBatchDate now that Elastic.Mapping supports it,
and pass batchTimestamp to IngestChannelOptions in the lexical-only path
so the channel uses the exporter's timestamp for index name computation.
…meter

Simplify DocumentationTooling endpoint resolution by delegating to
ElasticsearchEndpointFactory. Add missing skipOpenApi parameter to
IsolatedIndexService.Index call.
The lexical-only code path manually reimplemented drain, delete-stale,
refresh, and alias logic that the orchestrator handles automatically.
Remove the flag end-to-end: CLI parameters, configuration, exporter
branching, and CLI documentation.
@Mpdreamz Mpdreamz self-assigned this Feb 22, 2026
@Mpdreamz Mpdreamz requested review from a team and reakaleek February 22, 2026 17:41
@Mpdreamz Mpdreamz changed the title feature/ingest rearch Ingestion re-implement on updated Elastic.Ingest.Elasticsearch Feb 22, 2026
@github-actions
Copy link

github-actions bot commented Feb 22, 2026

🔍 Preview links for changed docs

Add .jina-embeddings-v5-text-small inference on 6 fields (title, abstract,
ai_rag_optimized_summary, ai_questions, ai_use_cases, stripped_body) to
enable hybrid sparse+dense retrieval. Rename InferenceId to ElserInferenceId
for clarity.
Use source-generated IStaticMappingResolver delegates for auto-stamping
BatchIndexDate and LastUpdated instead of manual assignment. Replace
DocumentationAnalysisFactory.CreateContext with direct context
customization via WithIndexName() and record-with expressions. Pass
IndexSettings for default_pipeline conditionally at runtime.
…nment

Rename indexNamespace to buildType throughout the exporter pipeline so
callers pass the build type (assembler, isolated, codex) instead of the
environment name. Search services now hardcode "assembler" as the type
since they always target assembler indices.

ResolveNamespace renamed to ResolveEnvironment and updated to parse the
old production index format ({variant}-docs-{env}-{timestamp}) to
extract the environment name.
… to simplify index naming logic. Update Elasticsearch dependencies to version 0.28.0.
@Mpdreamz Mpdreamz marked this pull request as ready for review February 24, 2026 19:30
@Mpdreamz Mpdreamz requested a review from a team as a code owner February 24, 2026 19:30
…entOrchestrator

Upgrade Elastic.Ingest.Elasticsearch and Elastic.Mapping to 0.30.0 which includes
source-generated AI enrichment support (elastic/elastic-ingest-dotnet#151).

- Annotate DocumentationDocument with [AiInput]/[AiField] attributes
- Add [AiEnrichment<DocumentationDocument>] to DocumentationMappingContext
- Replace ElasticsearchEnrichmentCache + ElasticsearchLlmClient + EnrichPolicyManager
  with a single AiEnrichmentOrchestrator that runs post-indexing
- Remove 7 handrolled enrichment files (~1650 lines) and associated tests

Made-with: Cursor
Made-with: Cursor

# Conflicts:
#	src/api/Elastic.Documentation.Mcp.Remote/Program.cs
Mpdreamz added 7 commits March 2, 2026 13:35
- AiEnrichmentOrchestrator now takes (ITransport, ElasticsearchTypeContext)
  instead of (ITransport, IAiEnrichmentProvider)
- EnrichAsync uses streaming IAsyncEnumerable<AiEnrichmentProgress> API
  with per-phase progress logging
- Fix bug: AI enrichment pipeline only set on semantic (secondary) index,
  no longer wastefully applied to lexical (primary) index
- Add OnReindexProgress and OnDeleteByQueryProgress logging callbacks
- IConfigureElasticsearch<T> now requires ConfigureAnalysis and IndexSettings
- AI enrichment enabled by default; CLI flag flipped to --no-ai-enrichment

Made-with: Cursor
…etry config

Configure AI enrichment with 2-minute completion timeout (down from 5m
default) and explicit 2 retries for ES|QL COMPLETION calls that fail
with HTTP 408/429/5xx.

Made-with: Cursor
…tions

Log bulk response details (HTTP status, item/error counts, buffer size)
on every response. Emit diagnostics error when max retries are exhausted.

Made-with: Cursor
Replace raw HTTP status/item dump with cumulative indexed count using
per-channel Interlocked counters. Also bump default BufferSize to 100.

Made-with: Cursor
Replace the ad-hoc `buildType` string parameter with `DocumentationEndpoints.DataSource`
(resolved from `DOCS_BUILD_TYPE` env var, default "isolated") and rename `Namespace` to
`Environment` (resolved from `DOTNET_ENVIRONMENT`/`ENVIRONMENT`, default "dev"). This
ensures both write (indexing) and read (search) paths use a single source of truth for
index naming. Remove legacy `DOCUMENTATION_ELASTIC_INDEX` env var parsing. API logging
now uses `ElasticsearchClientAccessor.SearchIndex` instead of duplicating `CreateContext`.

Made-with: Cursor
Add IndexVariant = "Semantic" to [AiEnrichment] so the provider attaches only
to the semantic context and derives its AI cache name from the semantic write
alias. Switch AiEnrichmentOrchestrator to use the semantic type context. Also
gains binary-split batch reduction on COMPLETION timeouts and "dev" default
namespace fallback from the upstream release.

Made-with: Cursor
…ng workarounds

Bump to 0.34.3 which fixes secondary index rollover (now creates new backing
index when hash changes) and exposes IndexRolloverInfo diagnostics. Wire up
OnRolloverDecision callback for per-index hash logging. Add explicit AddField
declarations for ai_questions, ai_use_cases (base text type), and product/
related_products sub-fields (keyword with normalizer) to work around source
generator gaps with dot-path merge and [Object] sub-type traversal.

Made-with: Cursor
Mpdreamz added 3 commits March 3, 2026 21:54
…st diagnostics

Pass env: explicitly to CreateContext() instead of relying on
ResolveDefaultNamespace() which reads DOTNET_ENVIRONMENT raw (returning
"Development" instead of "dev"). Add environment parameter to
ElasticsearchEndpointFactory.Create() so tests can pin the environment.
Add diagnostic output (endpoint, index, doc count) to all integration
tests for easier debugging when results are empty.

Made-with: Cursor
Replace manually constructed SearchConfiguration with
ConfigurationFileProvider.CreateSearchConfiguration() to keep tests
in sync with the real config/search.yml.

Made-with: Cursor
…ude environment in resource names

Rename DocumentationEndpoints.DataSource to BuildType to match the
DOCS_BUILD_TYPE env var. Update Elastic.Ingest.Elasticsearch and
Elastic.Mapping to 0.34.5 which fixes [Text]+[JsonIgnore(WhenWritingNull)]
and [Object] sub-type attribute traversal, removing workaround AddField
calls for ai_questions, ai_use_cases, product.*, and related_products.*.
Include environment in synonym set and ruleset names for proper isolation
(e.g. docs-assembler-dev, docs-ruleset-assembler-dev).

Made-with: Cursor
/// Build type identifier (assembler, isolated, codex). Controlled by DOCS_BUILD_TYPE env var.
/// </summary>
public string DataSource { get; set; } = "isolated";
public string BuildType { get; set; } = "isolated";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an existing BuildType enum. Wondering if we should reuse it.

The synonymSetName for analysis config was updated but the setName
in PublishSynonymsAsync was still using the old format without
environment.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants