Skip to content

feat: add docs-ingestion and docs-embeddings pipelines#422

Merged
pulkit004 merged 20 commits intomainfrom
docs-ingestion
Mar 25, 2026
Merged

feat: add docs-ingestion and docs-embeddings pipelines#422
pulkit004 merged 20 commits intomainfrom
docs-ingestion

Conversation

@pulkit004
Copy link
Copy Markdown
Contributor

Body:

Summary

  • Add docs-ingestion pipeline: scans MDX/MD content and API specs, normalizes, chunks, and outputs structured JSON for RAG
  • Add docs-embeddings pipeline: reads chunks, generates embeddings via AWS Bedrock (Titan v2), upserts to Pinecone, uploads content to S3
  • Add CI workflow for docs-ingestion (build, test, normalization determinism check, token limits, smoke test, embedding dry-run)
  • Fix: use plain Markdown parser for normalized files (residual HTML breaks MDX parser)
  • Fix: defer VectorDB construction for dry-run mode
  • Update .gitignore for tooling artifacts

Test plan

  • cd docs-ingestion && npm ci && npm test — all tests pass
  • npm run smoke-test-ingestion — end-to-end smoke test passes
  • npm run normalize-api-specs — deterministic output
  • cd docs-embeddings && npm ci && npm run dry-run — validates without external API calls
  • CI pipeline passes on this PR

pulkitsetu and others added 13 commits January 7, 2026 16:30
- Added main ingestion pipeline in `index.ts` to orchestrate the processing of markdown files.
- Introduced modules for scanning, parsing, chunking, and metadata extraction.
- Implemented incremental updates to skip unchanged files and merge with previous ingestion results.
- Created utility functions for file scanning, text cleaning, and chunk validation.
- Established a structured output format for processed documentation chunks.
- Added TypeScript configuration for improved type safety and module resolution.
- Introduced a new script `normalize-api-specs.ts` for converting OpenAPI 3.x / Swagger 2.0 JSON specs into deterministic Markdown files, ensuring token limits for RAG compatibility.
- Implemented functions for JSON serialization, safe parsing, schema rendering, and endpoint parsing.
- Added a new method `parseMarkdownFileAsPlainMd` in `parser.ts` to handle plain Markdown files, specifically for API spec normalized files containing JSON code blocks.
…line

- Add sentence-level splitting fallback in splitOversizedChunk() for
  chunks exceeding the 1500-token embedding limit. Cascading delimiters:
  sentence punctuation → colon/semicolon → comma → pipe (tables) →
  word-level hard-split as last resort. Reduces oversized chunks from
  111 to 1 (unavoidable code-dominated block).

- Add extractProduct() and extractCategory() to metadata.ts, populating
  every chunk with product and category fields for Pinecone filtering.

- Fix MDX normalizer to emit paragraph breaks (\n\n) after </tr>,
  </table>, </div>, </blockquote>, </details>, </p> and around
  unwrapped JSX wrapper components (Card, Row, Portion, Text, Badge).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ust test expectation for discovered spec files
…nore

Normalized MDX files have JSX stripped, so residual HTML fragments break
the MDX parser. Switch to parseMarkdownFileAsPlainMd for ingestion.
Also ignore .ruflo/, .claude-flow/, and .mcp.json in .gitignore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Checklist to merge a PR 🚀

To merge this pull request, please take time to complete the checklist.

What action did you perform?

Review the corresponding checklist items for the action you performed and mark them done.

Edit an existing content (MDX) page

Checklist

  • Review changes using the MDX preview option
  • If the length of content >15000 chars, use the Content preview portal to view changes
  • If a redirect is needed to the existing page, add a key, value pair in redirects.json

Edit an existing API reference page

Checklist


Add a new content (MDX) page

Checklist

  • Create a .mdx file with the path as its name in the content folder
  • Add frontmatter with all the metadata
  • Review the order of items in Sidebar using the Sidebar preview option
  • Review changes using the MDX preview option
  • If the length of content >15000 chars, use the Content preview portal to view changes
  • Created a folder with the same name, if any children were to be added to the page
  • Once all changes are done, update the menu items by using the Menu Items option
  • Add a key, and value pair in redirects.json if you wish to have a redirect to the new page

Add a new API reference page

Checklist

  • Create a .json file with the product path as its name
  • Create an api-reference.mdx file in the respective product folder inside content folder
  • Add frontmatter with all the metadata
  • Review the order of items in Sidebar using the Sidebar preview option
  • Add API reference in JSON format (OpenAPI or Swagger) into created .json file.
  • Used the Content preview portal to view changes
  • Once all changes are done, update the menu items by using the Menu Items option

pulkitsetu and others added 4 commits March 23, 2026 11:13
- Replace package-lock.json with yarn.lock in both docs-ingestion and
  docs-embeddings
- Configure nodeLinker: node-modules in .yarnrc.yml to avoid PnP issues
- Update CI workflow: npm ci → yarn install --immutable, npm run → yarn
- Update package.json scripts to use yarn internally
- Add staging to CI trigger branches
- Update .gitignore files accordingly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The normalize-mdx.test.ts integration tests require .docs-normalized/
to exist. Add the normalization step before running tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical fixes:
- Fix region typo ap-south-11 → ap-south-1 in embed-all.ts, verify-embed.ts
- Fix worker pool error handling: catch failures, filter undefined results,
  throw if >2% fail
- Fix state file corruption: ENOENT-only silent catch, backup corrupt files,
  atomic writes via temp+rename
- Fix batch upsert: 3 retries with exponential backoff, circuit breaker
  after 3 consecutive failures
- Fix state hash tracking: only save successfully embedded hashes
- Fix batchDelete: add retry logic with backoff
- Fix API spec accumulation: always strip stale api-reference chunks

High priority:
- Git info fallback: return defaults instead of throwing in shallow clones
- File I/O retry: retryAsync utility with exponential backoff for transient
  FS errors (EMFILE, EAGAIN, ENFILE, EBUSY)
- Reduce Bedrock concurrency default from 5 to 3 (ap-south-1 rate limits)
- Pinecone 10k vector limit warning in listAllIds()

Token optimization:
- Chunk target 600→700, overlap 125→80 (60-100 range)
- MAX_EMBEDDABLE_TOKENS safety margin: 1400 (from 1500)
- Near-minimum chunk warning (80-100 tokens)

Security:
- Prompt injection detection regex (11 patterns) in text-cleaner.ts
- Injection warnings during chunk processing in index.ts
- yarn audit step in CI (non-blocking)
- Migrate default AWS region to ap-south-1

Observability:
- Memory usage logging in sync.ts
- Deletion audit logging with per-hash detail

Tests & docs:
- 14 new tests: detectPromptInjection (10), retryAsync (4)
- Move retryAsync to shared utils.ts
- Update CLAUDE.md with new constants

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI runners have Yarn 1.22 globally. Add corepack enable step so CI
uses Yarn 4.12.0 matching local development. Add packageManager field
to both package.json files for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pulkit004 pulkit004 changed the title Title: feat: docs ingestion and embedding pipelines feat: add docs-ingestion and docs-embeddings pipelines Mar 25, 2026
Critical:
- Fix paragraph position lookup in chunker.ts to use running offset
  instead of indexOf (which always found first occurrence for duplicates)
- Move state save AFTER Pinecone upsert in sync.ts so a crash between
  them no longer silently loses chunks from the vector DB

Important:
- Fix deduplication no-op: previously-processed files now correctly
  skipped when commit SHA unchanged (deduplication.ts)
- Replace npm run with yarn in embed-all.ts after Yarn 4 migration
- Align embedding token threshold from 1500 to 1400 in sync.ts to
  match embedding-helpers.ts
- Add 3-attempt retry with exponential backoff to S3 uploads
  (content-uploader.ts)
- Recompute token_count after text cleaning in metadata.ts so
  downstream embedding filters use accurate values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +17 to +26
const MAX_EMBEDDABLE_TOKENS = 1400; // Above this: dominates vector search (includes safety margin)
const OVERSIZED_THRESHOLD = 1200; // Flag for special handling

/**
* Determines if a chunk should be embedded
*
* Rules:
* - Skip if < 80 tokens (micro-chunks with no semantic value)
* - Skip if > 1500 tokens (oversized chunks that dominate retrieval)
* - Embed everything else
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatch in the MAX_EMEDDABLE_TOKENS = 1400, make it 1500 since the embedding pipeline filters at 1500, in sync.ts within docs-embedding. Also disputes with the comment

try {
// Get content directory from command line or use default
// Default: ../content (assumes running from docs-ingestion/)
const contentDir = process.argv[2] || path.join(process.cwd(), '..', 'content');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within CI we run normalize-mdx to generate docs-normalized folder. Which then should be used to generate the chunks. As we do in the smoke-test-ingestion.

However when we call the node/index.js for docs-ingestion job, no argv[2] is passed for the normalised mdx and hence the entire chunks are made out of ./content folder. Are we intentionally making this distinction?

Ideally we should have the normalized MDX as the source of truth for the knowledge base


- name: Run ingestion pipeline
working-directory: docs-ingestion
run: node dist/index.js
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no argument for the file path for ./docs-normalized. intentional?

pulkitsetu and others added 2 commits March 25, 2026 17:02
- Warn when safeParseJSON falls back to lenient parsing (normalize-api-specs.ts)
- Guard pipe-split fallback against markdown tables in splitBySentence (chunker.ts)
- Extract MERGE_MAX_CHUNK_SIZE constant from magic number 800 (chunker.ts)
- Use robust import.meta.url for main-module detection (normalize-mdx.ts)
- Add empty API key guard in VectorDB constructor (vector-db.ts)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lower MIN_EMBEDDABLE_TOKENS from 80 to 50 and raise MAX from 1400 to
1600 to recover ~430 previously unembeddable chunks (87.7% → 95.4%).

- Add FORCE_EMBED_PATTERNS with isForceEmbeddable() function to bypass
  max threshold for critical content (UMAP mandate responses)
- Enable cross-section micro-chunk merging in mergeMicroChunks with
  tighter size limit (700 vs 800) for cross-section joins
- Add size guard to forward-merge in mergeMicroChunks
- Update CLAUDE.md constants table

Dry run validated: 5576 chunks, 4124 embeddable (after dedup), 0 too
large, 258 too small (<50 tok).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@@ -0,0 +1,558 @@
/**
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to @pankaj0308's review comments:


Re: MAX_EMBEDDABLE_TOKENS mismatch (embedding-helpers.ts)

Good catch — this has been fixed. Both embedding-helpers.ts and sync.ts now use aligned thresholds:

  • MIN_EMBEDDABLE_TOKENS = 50 (lowered from 80 to recover ~300 borderline chunks)
  • MAX_EMBEDDABLE_TOKENS = 1600 (raised from 1400 — Titan v2 supports 8192)

Additionally, a FORCE_EMBED_PATTERNS list with isForceEmbeddable() function was added so critical content (like UMAP mandate responses at 1665 tokens) bypasses the max threshold entirely. See commit f9bb103.


Re: Ingestion uses ./content instead of .docs-normalized/ + CI missing path argument

This is by design. The ingestion pipeline uses parseMarkdownFileAsPlainMd (line 153 in index.ts) which handles raw MDX files directly — it strips JSX/HTML at parse time, producing the same result as feeding in pre-normalized files.

The normalize-mdx step exists primarily for CI determinism validation (run twice, diff output to ensure deterministic normalization). It's not a required prerequisite for ingestion.

Two equivalent paths exist:

  • Path A: normalize-mdx → feed .docs-normalized/ → ingest
  • Path B: Feed raw ./content → ingest (normalizes on-the-fly via plain MD parser)

Both produce identical chunks. Path B is simpler for standalone runs. Path A is used in CI for the determinism check.

If we want to consolidate to a single path in the future, we can make normalize-mdx the mandatory first step and update the default contentDir — but there's no correctness issue with the current approach.

@pulkit004 pulkit004 merged commit b114247 into main Mar 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants