-
Notifications
You must be signed in to change notification settings - Fork 64
feat: add docs-ingestion and docs-embeddings pipelines #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
e1ef012
feat: Implement documentation ingestion pipeline
pulkitsetu 3da1bff
feat: Add documentation embedding pipeline with AWS Bedrock integration
pulkitsetu 1396640
feat: Implement S3 content upload pipeline with environment variable …
pulkitsetu 1f61910
refactor: improve formatting and readability of MDX files by consolid…
pulkitsetu cfd9dce
feat: Add MDX normalization script and related tests, update package.…
pulkitsetu 9e68e2d
feat: add API spec normalization script and plain Markdown parser
pulkitsetu c4bff26
refactor: update embedding pipeline documentation and remove FORCE_EM…
pulkitsetu 703d7cd
feat: fix oversized chunks and add product metadata to ingestion pipe…
pulkitsetu 5d61210
update with docs
pulkitsetu be2bd14
fix: update .gitignore to remove package-lock.json and yarn.lock; adj…
pulkitsetu 1308c97
fix: restrict pull request branches to main for Docs Ingestion CI
pulkitsetu 254f04e
fix: defer VectorDB construction for dry-run mode and update access m…
pulkitsetu ca99dc6
fix: use plain Markdown parser for normalized files and update .gitig…
pulkitsetu 8877aa5
chore: migrate from npm to yarn and trigger CI on staging PRs
pulkitsetu bce9b1b
fix: run normalize-mdx before tests in CI
pulkitsetu 0442bbf
feat: production readiness fixes for ingestion and embedding pipelines
pulkitsetu edd885b
fix: enable Corepack for Yarn 4 in CI
pulkitsetu 22eaf1a
fix: address code review findings for ingestion and embedding pipelines
pulkitsetu fb6a0d8
fix: apply review suggestions for ingestion pipeline
pulkitsetu f9bb103
feat: reduce knowledge gaps with dynamic embedding thresholds
pulkitsetu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,327 @@ | ||
| name: Docs Ingestion CI | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: [main, staging] | ||
| paths: | ||
| - 'docs-ingestion/**' | ||
| - 'docs-embeddings/**' | ||
| - 'api-references/**' | ||
| - 'content/**' | ||
| push: | ||
| branches: [main, staging] | ||
| paths: | ||
| - 'docs-ingestion/**' | ||
| - 'docs-embeddings/**' | ||
| - 'api-references/**' | ||
| - 'content/**' | ||
|
|
||
| jobs: | ||
| build-and-test: | ||
| name: Build & Test | ||
| runs-on: ubuntu-latest | ||
| defaults: | ||
| run: | ||
| working-directory: docs-ingestion | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 | ||
|
|
||
| - name: Enable Corepack | ||
| run: corepack enable | ||
|
|
||
| - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 | ||
| with: | ||
| node-version: '20' | ||
| cache: 'yarn' | ||
| cache-dependency-path: docs-ingestion/yarn.lock | ||
|
|
||
| - name: Install dependencies | ||
| run: yarn install --immutable | ||
|
|
||
| - name: Security audit | ||
| run: yarn npm audit --severity moderate | ||
| continue-on-error: true | ||
|
|
||
| - name: Build | ||
| run: yarn build | ||
|
|
||
| - name: Normalize MDX (required by integration tests) | ||
| run: yarn normalize-mdx | ||
|
|
||
| # ── E. Test suite ── | ||
| - name: Run tests | ||
| run: yarn test | ||
|
|
||
| normalize-api-specs: | ||
| name: API Spec Normalization | ||
| runs-on: ubuntu-latest | ||
| defaults: | ||
| run: | ||
| working-directory: docs-ingestion | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 | ||
|
|
||
| - name: Enable Corepack | ||
| run: corepack enable | ||
|
|
||
| - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 | ||
| with: | ||
| node-version: '20' | ||
| cache: 'yarn' | ||
| cache-dependency-path: docs-ingestion/yarn.lock | ||
|
|
||
| - name: Install dependencies | ||
| run: yarn install --immutable | ||
|
|
||
| # ── A. Normalization run ── | ||
| - name: Run API spec normalization | ||
| run: yarn normalize-api-specs | ||
|
|
||
| - name: Verify output directory exists | ||
| run: | | ||
| if [ ! -d "../.api-reference-normalized" ]; then | ||
| echo "FAIL: .api-reference-normalized/ directory does not exist" | ||
| exit 1 | ||
| fi | ||
| echo "PASS: Directory exists" | ||
|
|
||
| - name: Verify file count | ||
| run: | | ||
| count=$(find ../.api-reference-normalized -name '*.md' | wc -l | tr -d ' ') | ||
| echo "Found $count normalized files" | ||
| if [ "$count" -lt 200 ]; then | ||
| echo "FAIL: Expected at least 200 files, got $count" | ||
| exit 1 | ||
| fi | ||
| echo "PASS: File count ($count) >= 200" | ||
|
|
||
| # ── B. Determinism check ── | ||
| - name: Copy first run output | ||
| run: cp -r ../.api-reference-normalized /tmp/api-ref-norm-run1 | ||
|
|
||
| - name: Run normalization again | ||
| run: yarn normalize-api-specs | ||
|
|
||
| - name: Verify determinism | ||
| run: | | ||
| diff_output=$(diff -r ../.api-reference-normalized /tmp/api-ref-norm-run1 2>&1) || true | ||
| if [ -n "$diff_output" ]; then | ||
| echo "FAIL: Normalization is not deterministic:" | ||
| echo "$diff_output" | head -20 | ||
| exit 1 | ||
| fi | ||
| echo "PASS: Output is deterministic" | ||
|
|
||
| # ── C. Token limit compliance ── | ||
| - name: Check token limits | ||
| run: yarn check-token-limits | ||
|
|
||
| # ── F. Git ignored state check ── | ||
| - name: Verify .api-reference-normalized is gitignored | ||
| run: | | ||
| cd .. | ||
| if git ls-files --error-unmatch .api-reference-normalized/ 2>/dev/null; then | ||
| echo "FAIL: .api-reference-normalized/ is tracked by git" | ||
| exit 1 | ||
| fi | ||
| echo "PASS: .api-reference-normalized/ is not tracked" | ||
|
|
||
| - name: Verify .docs-normalized is gitignored | ||
| run: | | ||
| cd .. | ||
| if git ls-files --error-unmatch .docs-normalized/ 2>/dev/null; then | ||
| echo "FAIL: .docs-normalized/ is tracked by git" | ||
| exit 1 | ||
| fi | ||
| echo "PASS: .docs-normalized/ is not tracked" | ||
|
|
||
| ingestion-smoke-test: | ||
| name: Ingestion Smoke Test | ||
| runs-on: ubuntu-latest | ||
| needs: [build-and-test, normalize-api-specs] | ||
| defaults: | ||
| run: | ||
| working-directory: docs-ingestion | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 | ||
|
|
||
| - name: Enable Corepack | ||
| run: corepack enable | ||
|
|
||
| - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 | ||
| with: | ||
| node-version: '20' | ||
| cache: 'yarn' | ||
| cache-dependency-path: docs-ingestion/yarn.lock | ||
|
|
||
| - name: Install dependencies | ||
| run: yarn install --immutable | ||
|
|
||
| - name: Build | ||
| run: yarn build | ||
|
|
||
| # Generate the normalized API specs (required for smoke test) | ||
| - name: Normalize API specs | ||
| run: yarn normalize-api-specs | ||
|
|
||
| # Normalize MDX if content/ exists | ||
| - name: Normalize MDX (if content exists) | ||
| run: | | ||
| if [ -d "../content" ]; then | ||
| yarn normalize-mdx || exit 1 | ||
| else | ||
| echo "No content/ directory — skipping MDX normalization" | ||
| fi | ||
|
|
||
| # ── D. Ingestion smoke test ── | ||
| - name: Run ingestion smoke test | ||
| run: yarn smoke-test-ingestion | ||
|
|
||
| embedding-dry-run: | ||
| name: Embedding Dry Run | ||
| runs-on: ubuntu-latest | ||
| needs: [ingestion-smoke-test] | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 | ||
|
|
||
| - name: Enable Corepack | ||
| run: corepack enable | ||
|
|
||
| - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 | ||
| with: | ||
| node-version: '20' | ||
| cache: 'yarn' | ||
| cache-dependency-path: | | ||
| docs-ingestion/yarn.lock | ||
| docs-embeddings/yarn.lock | ||
|
|
||
| # Build ingestion pipeline and produce chunks.json | ||
| - name: Install ingestion dependencies | ||
| working-directory: docs-ingestion | ||
| run: yarn install --immutable | ||
|
|
||
| - name: Build ingestion | ||
| working-directory: docs-ingestion | ||
| run: yarn build | ||
|
|
||
| - name: Normalize API specs | ||
| working-directory: docs-ingestion | ||
| run: yarn normalize-api-specs | ||
|
|
||
| - name: Normalize MDX (if content exists) | ||
| working-directory: docs-ingestion | ||
| run: | | ||
| if [ -d "../content" ]; then | ||
| yarn normalize-mdx || exit 1 | ||
| else | ||
| echo "No content/ directory — skipping MDX normalization" | ||
| fi | ||
|
|
||
| - name: Run ingestion pipeline | ||
| working-directory: docs-ingestion | ||
| run: node dist/index.js | ||
|
|
||
| - name: Verify chunks.json exists | ||
| run: | | ||
| if [ ! -f "docs-ingestion/output/chunks.json" ]; then | ||
| echo "FAIL: chunks.json not produced" | ||
| exit 1 | ||
| fi | ||
| echo "PASS: chunks.json exists" | ||
|
|
||
| # Build embeddings pipeline and run dry-run | ||
| - name: Install embedding dependencies | ||
| working-directory: docs-embeddings | ||
| run: yarn install --immutable | ||
|
|
||
| - name: Build embeddings | ||
| working-directory: docs-embeddings | ||
| run: yarn build | ||
|
|
||
| # ── E. Embedding dry run ── | ||
| - name: Run embedding dry run | ||
| working-directory: docs-embeddings | ||
| env: | ||
| DRY_RUN: 'true' | ||
| INGESTION_OUTPUT_PATH: ${{ github.workspace }}/docs-ingestion/output/chunks.json | ||
| run: node dist/index.js --dry-run | ||
|
|
||
| # ── Deploy: update knowledge base (main only) ── | ||
| deploy-knowledge-base: | ||
| name: Deploy Knowledge Base | ||
| runs-on: ubuntu-latest | ||
| needs: [embedding-dry-run] | ||
| if: github.event_name == 'push' && github.ref == 'refs/heads/main' | ||
| permissions: | ||
| id-token: write | ||
| contents: read | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 | ||
|
|
||
| - name: Enable Corepack | ||
| run: corepack enable | ||
|
|
||
| - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 | ||
| with: | ||
| node-version: '20' | ||
| cache: 'yarn' | ||
| cache-dependency-path: | | ||
| docs-ingestion/yarn.lock | ||
| docs-embeddings/yarn.lock | ||
|
|
||
| - name: Configure AWS Credentials | ||
| uses: aws-actions/configure-aws-credentials@ff717079ee2060e4bcee96c4779b553acc87447c # v4 | ||
| with: | ||
| role-to-assume: ${{ secrets.AWS_ROLE_ARN }} | ||
| aws-region: ap-south-1 | ||
|
|
||
| # ── Build ingestion pipeline ── | ||
| - name: Install ingestion dependencies | ||
| working-directory: docs-ingestion | ||
| run: yarn install --immutable | ||
|
|
||
| - name: Build ingestion | ||
| working-directory: docs-ingestion | ||
| run: yarn build | ||
|
|
||
| - name: Normalize API specs | ||
| working-directory: docs-ingestion | ||
| run: yarn normalize-api-specs | ||
|
|
||
| - name: Normalize MDX | ||
| working-directory: docs-ingestion | ||
| run: yarn normalize-mdx | ||
|
|
||
| - name: Run ingestion pipeline | ||
| working-directory: docs-ingestion | ||
| run: node dist/index.js | ||
|
|
||
| # ── Upload content to S3 ── | ||
| - name: Upload content to S3 | ||
| working-directory: docs-ingestion | ||
| env: | ||
| CONTENT_BUCKET_NAME: ${{ secrets.CONTENT_BUCKET_NAME }} | ||
| run: node dist/upload-content.js | ||
|
|
||
| # ── Build and run embedding sync ── | ||
| - name: Install embedding dependencies | ||
| working-directory: docs-embeddings | ||
| run: yarn install --immutable | ||
|
|
||
| - name: Build embeddings | ||
| working-directory: docs-embeddings | ||
| run: yarn build | ||
|
|
||
| - name: Run embedding sync | ||
| working-directory: docs-embeddings | ||
| env: | ||
| PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }} | ||
| PINECONE_INDEX: ${{ secrets.PINECONE_INDEX }} | ||
| CONTENT_BUCKET_NAME: ${{ secrets.CONTENT_BUCKET_NAME }} | ||
| INGESTION_OUTPUT_PATH: ${{ github.workspace }}/docs-ingestion/output/chunks.json | ||
| run: node dist/index.js | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| # Normalized MDX output (generated by docs-ingestion/normalize-mdx) | ||
| .docs-normalized/ | ||
|
|
||
| # Normalized API spec output (generated by docs-ingestion/normalize-api-specs) | ||
| .api-reference-normalized/ | ||
|
|
||
| # Ruflo | ||
| .ruflo/ | ||
|
|
||
| # Claude Code | ||
| .claude-flow/ | ||
| .mcp.json | ||
| .claude | ||
| .swarm |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no argument for the file path for
./docs-normalized. intentional?