feat: add docs-ingestion and docs-embeddings pipelines by pulkit004 · Pull Request #422 · SetuHQ/docs

pulkit004 · 2026-03-18T06:29:43Z

Body:

Summary

Add docs-ingestion pipeline: scans MDX/MD content and API specs, normalizes, chunks, and outputs structured JSON for RAG
Add docs-embeddings pipeline: reads chunks, generates embeddings via AWS Bedrock (Titan v2), upserts to Pinecone, uploads content to S3
Add CI workflow for docs-ingestion (build, test, normalization determinism check, token limits, smoke test, embedding dry-run)
Fix: use plain Markdown parser for normalized files (residual HTML breaks MDX parser)
Fix: defer VectorDB construction for dry-run mode
Update .gitignore for tooling artifacts

Test plan

cd docs-ingestion && npm ci && npm test — all tests pass
npm run smoke-test-ingestion — end-to-end smoke test passes
npm run normalize-api-specs — deterministic output
cd docs-embeddings && npm ci && npm run dry-run — validates without external API calls
CI pipeline passes on this PR

- Added main ingestion pipeline in `index.ts` to orchestrate the processing of markdown files. - Introduced modules for scanning, parsing, chunking, and metadata extraction. - Implemented incremental updates to skip unchanged files and merge with previous ingestion results. - Created utility functions for file scanning, text cleaning, and chunk validation. - Established a structured output format for processed documentation chunks. - Added TypeScript configuration for improved type safety and module resolution.

…support

…ating HTML/JSX tags to single lines.

…json and gitignore

- Introduced a new script `normalize-api-specs.ts` for converting OpenAPI 3.x / Swagger 2.0 JSON specs into deterministic Markdown files, ensuring token limits for RAG compatibility. - Implemented functions for JSON serialization, safe parsing, schema rendering, and endpoint parsing. - Added a new method `parseMarkdownFileAsPlainMd` in `parser.ts` to handle plain Markdown files, specifically for API spec normalized files containing JSON code blocks.

…BED requirement for existing vectors

…line - Add sentence-level splitting fallback in splitOversizedChunk() for chunks exceeding the 1500-token embedding limit. Cascading delimiters: sentence punctuation → colon/semicolon → comma → pipe (tables) → word-level hard-split as last resort. Reduces oversized chunks from 111 to 1 (unavoidable code-dominated block). - Add extractProduct() and extractCategory() to metadata.ts, populating every chunk with product and category fields for Pinecone filtering. - Fix MDX normalizer to emit paragraph breaks (\n\n) after </tr>, </table>, </div>, </blockquote>, </details>, </p> and around unwrapped JSX wrapper components (Card, Row, Portion, Text, Badge). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ust test expectation for discovered spec files

…ethods

…nore Normalized MDX files have JSX stripped, so residual HTML fragments break the MDX parser. Switch to parseMarkdownFileAsPlainMd for ingestion. Also ignore .ruflo/, .claude-flow/, and .mcp.json in .gitignore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-18T06:29:54Z

- Replace package-lock.json with yarn.lock in both docs-ingestion and docs-embeddings - Configure nodeLinker: node-modules in .yarnrc.yml to avoid PnP issues - Update CI workflow: npm ci → yarn install --immutable, npm run → yarn - Update package.json scripts to use yarn internally - Add staging to CI trigger branches - Update .gitignore files accordingly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The normalize-mdx.test.ts integration tests require .docs-normalized/ to exist. Add the normalization step before running tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Critical fixes: - Fix region typo ap-south-11 → ap-south-1 in embed-all.ts, verify-embed.ts - Fix worker pool error handling: catch failures, filter undefined results, throw if >2% fail - Fix state file corruption: ENOENT-only silent catch, backup corrupt files, atomic writes via temp+rename - Fix batch upsert: 3 retries with exponential backoff, circuit breaker after 3 consecutive failures - Fix state hash tracking: only save successfully embedded hashes - Fix batchDelete: add retry logic with backoff - Fix API spec accumulation: always strip stale api-reference chunks High priority: - Git info fallback: return defaults instead of throwing in shallow clones - File I/O retry: retryAsync utility with exponential backoff for transient FS errors (EMFILE, EAGAIN, ENFILE, EBUSY) - Reduce Bedrock concurrency default from 5 to 3 (ap-south-1 rate limits) - Pinecone 10k vector limit warning in listAllIds() Token optimization: - Chunk target 600→700, overlap 125→80 (60-100 range) - MAX_EMBEDDABLE_TOKENS safety margin: 1400 (from 1500) - Near-minimum chunk warning (80-100 tokens) Security: - Prompt injection detection regex (11 patterns) in text-cleaner.ts - Injection warnings during chunk processing in index.ts - yarn audit step in CI (non-blocking) - Migrate default AWS region to ap-south-1 Observability: - Memory usage logging in sync.ts - Deletion audit logging with per-hash detail Tests & docs: - 14 new tests: detectPromptInjection (10), retryAsync (4) - Move retryAsync to shared utils.ts - Update CLAUDE.md with new constants Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI runners have Yarn 1.22 globally. Add corepack enable step so CI uses Yarn 4.12.0 matching local development. Add packageManager field to both package.json files for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Critical: - Fix paragraph position lookup in chunker.ts to use running offset instead of indexOf (which always found first occurrence for duplicates) - Move state save AFTER Pinecone upsert in sync.ts so a crash between them no longer silently loses chunks from the vector DB Important: - Fix deduplication no-op: previously-processed files now correctly skipped when commit SHA unchanged (deduplication.ts) - Replace npm run with yarn in embed-all.ts after Yarn 4 migration - Align embedding token threshold from 1500 to 1400 in sync.ts to match embedding-helpers.ts - Add 3-attempt retry with exponential backoff to S3 uploads (content-uploader.ts) - Recompute token_count after text cleaning in metadata.ts so downstream embedding filters use accurate values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pankaj0308 · 2026-03-25T10:58:12Z

docs-ingestion/src/embedding-helpers.ts

+const MAX_EMBEDDABLE_TOKENS = 1400;  // Above this: dominates vector search (includes safety margin)
+const OVERSIZED_THRESHOLD = 1200;    // Flag for special handling
+
+/**
+ * Determines if a chunk should be embedded
+ *
+ * Rules:
+ * - Skip if < 80 tokens (micro-chunks with no semantic value)
+ * - Skip if > 1500 tokens (oversized chunks that dominate retrieval)
+ * - Embed everything else


Mismatch in the MAX_EMEDDABLE_TOKENS = 1400, make it 1500 since the embedding pipeline filters at 1500, in sync.ts within docs-embedding. Also disputes with the comment

pankaj0308 · 2026-03-25T11:10:12Z

docs-ingestion/src/index.ts

+  try {
+    // Get content directory from command line or use default
+    // Default: ../content (assumes running from docs-ingestion/)
+    const contentDir = process.argv[2] || path.join(process.cwd(), '..', 'content');


Within CI we run normalize-mdx to generate docs-normalized folder. Which then should be used to generate the chunks. As we do in the smoke-test-ingestion.

However when we call the node/index.js for docs-ingestion job, no argv[2] is passed for the normalised mdx and hence the entire chunks are made out of ./content folder. Are we intentionally making this distinction?

Ideally we should have the normalized MDX as the source of truth for the knowledge base

pankaj0308 · 2026-03-25T11:11:03Z

.github/workflows/docs-ingestion-ci.yml

+
+      - name: Run ingestion pipeline
+        working-directory: docs-ingestion
+        run: node dist/index.js


no argument for the file path for ./docs-normalized. intentional?

- Warn when safeParseJSON falls back to lenient parsing (normalize-api-specs.ts) - Guard pipe-split fallback against markdown tables in splitBySentence (chunker.ts) - Extract MERGE_MAX_CHUNK_SIZE constant from magic number 800 (chunker.ts) - Use robust import.meta.url for main-module detection (normalize-mdx.ts) - Add empty API key guard in VectorDB constructor (vector-db.ts) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Lower MIN_EMBEDDABLE_TOKENS from 80 to 50 and raise MAX from 1400 to 1600 to recover ~430 previously unembeddable chunks (87.7% → 95.4%). - Add FORCE_EMBED_PATTERNS with isForceEmbeddable() function to bypass max threshold for critical content (UMAP mandate responses) - Enable cross-section micro-chunk merging in mergeMicroChunks with tighter size limit (700 vs 800) for cross-section joins - Add size guard to forward-merge in mergeMicroChunks - Update CLAUDE.md constants table Dry run validated: 5576 chunks, 4124 embeddable (after dedup), 0 too large, 258 too small (<50 tok). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pulkit004 · 2026-03-25T11:38:29Z

docs-ingestion/src/index.ts

@@ -0,0 +1,558 @@
+/**


Responding to @pankaj0308's review comments:

Re: MAX_EMBEDDABLE_TOKENS mismatch (embedding-helpers.ts)

Good catch — this has been fixed. Both embedding-helpers.ts and sync.ts now use aligned thresholds:

MIN_EMBEDDABLE_TOKENS = 50 (lowered from 80 to recover ~300 borderline chunks)

MAX_EMBEDDABLE_TOKENS = 1600 (raised from 1400 — Titan v2 supports 8192)

Additionally, a FORCE_EMBED_PATTERNS list with isForceEmbeddable() function was added so critical content (like UMAP mandate responses at 1665 tokens) bypasses the max threshold entirely. See commit f9bb103.

Re: Ingestion uses ./content instead of .docs-normalized/ + CI missing path argument

This is by design. The ingestion pipeline uses parseMarkdownFileAsPlainMd (line 153 in index.ts) which handles raw MDX files directly — it strips JSX/HTML at parse time, producing the same result as feeding in pre-normalized files.

The normalize-mdx step exists primarily for CI determinism validation (run twice, diff output to ensure deterministic normalization). It's not a required prerequisite for ingestion.

Two equivalent paths exist:

Path A: normalize-mdx → feed .docs-normalized/ → ingest

Path B: Feed raw ./content → ingest (normalizes on-the-fly via plain MD parser)

Both produce identical chunks. Path B is simpler for standalone runs. Path A is used in CI for the determinism check.

If we want to consolidate to a single path in the future, we can make normalize-mdx the mandatory first step and update the default contentDir — but there's no correctness issue with the current approach.

pulkitsetu and others added 13 commits January 7, 2026 16:30

feat: Add documentation embedding pipeline with AWS Bedrock integration

3da1bff

feat: Implement S3 content upload pipeline with environment variable …

1396640

…support

refactor: improve formatting and readability of MDX files by consolid…

1f61910

…ating HTML/JSX tags to single lines.

feat: Add MDX normalization script and related tests, update package.…

cfd9dce

…json and gitignore

refactor: update embedding pipeline documentation and remove FORCE_EM…

c4bff26

…BED requirement for existing vectors

update with docs

5d61210

fix: update .gitignore to remove package-lock.json and yarn.lock; adj…

be2bd14

…ust test expectation for discovered spec files

fix: restrict pull request branches to main for Docs Ingestion CI

1308c97

fix: defer VectorDB construction for dry-run mode and update access m…

254f04e

…ethods

pulkitsetu and others added 4 commits March 23, 2026 11:13

fix: run normalize-mdx before tests in CI

bce9b1b

The normalize-mdx.test.ts integration tests require .docs-normalized/ to exist. Add the normalization step before running tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pulkit004 changed the title ~~Title: feat: docs ingestion and embedding pipelines~~ feat: add docs-ingestion and docs-embeddings pipelines Mar 25, 2026

pankaj0308 reviewed Mar 25, 2026

View reviewed changes

pulkitsetu and others added 2 commits March 25, 2026 17:02

pulkit004 commented Mar 25, 2026

View reviewed changes

pankaj0308 approved these changes Mar 25, 2026

View reviewed changes

pulkit004 merged commit b114247 into main Mar 25, 2026
4 checks passed

pulkit004 mentioned this pull request Mar 25, 2026

fix: review fixes, knowledge gap reduction, and env var hardening #423

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add docs-ingestion and docs-embeddings pipelines#422

feat: add docs-ingestion and docs-embeddings pipelines#422
pulkit004 merged 20 commits intomainfrom
docs-ingestion

pulkit004 commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

pankaj0308 Mar 25, 2026

Uh oh!

pankaj0308 Mar 25, 2026

Uh oh!

pankaj0308 Mar 25, 2026

Uh oh!

pulkit004 Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pulkit004 commented Mar 18, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 18, 2026

Checklist to merge a PR 🚀

What action did you perform?

Edit an existing content (MDX) page

Edit an existing API reference page

Add a new content (MDX) page

Add a new API reference page

Uh oh!

pankaj0308 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

pankaj0308 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

pankaj0308 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

pulkit004 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants