Skip to content

fix: review fixes, knowledge gap reduction, and env var hardening#423

Merged
pulkit004 merged 3 commits intomainfrom
docs-ingestion
Mar 25, 2026
Merged

fix: review fixes, knowledge gap reduction, and env var hardening#423
pulkit004 merged 3 commits intomainfrom
docs-ingestion

Conversation

@pulkit004
Copy link
Copy Markdown
Contributor

Summary

  • Fix critical and important issues from PR feat: add docs-ingestion and docs-embeddings pipelines #422 code review
  • Reduce knowledge gaps: 87.7% → 95.4% embeddable chunks (432 recovered)
  • Remove all hardcoded config defaults, require env vars
  • Add force-embed pattern for critical UMAP content
  • Use GitHub Actions variables (vars.*) instead of hardcoded values in CI

Changes

Code Review Fixes (22eaf1a, fb6a0d8)

  • Fix content.indexOf(paragraph) bug in chunker (always found first occurrence)
  • Fix state-before-upsert ordering in sync.ts (crash could lose chunks)
  • Fix deduplication no-op in deduplication.ts
  • S3 upload retry logic, token recount after cleaning
  • Review suggestions: safeParseJSON warning, pipe-split table guard, API key guard

Knowledge Gap Reduction (f9bb103)

  • Lower MIN_EMBEDDABLE_TOKENS 80→50, raise MAX 1400→1600
  • Cross-section micro-chunk merging with tighter size limit
  • FORCE_EMBED_PATTERNS with isForceEmbeddable() for critical content

Env Var Hardening (d3b95b0)

  • Remove all hardcoded ap-south-1, model IDs, index names from code
  • requireEnv() helper — fail-fast with clear error messages
  • CI workflow uses vars.* for non-sensitive config, secrets.* for sensitive
  • S3 bucket renamed to setu-docs-content-prod in ap-south-1

Required GitHub Actions Variables

Variable Value
AWS_REGION ap-south-1
BEDROCK_MODEL_ID amazon.titan-embed-text-v2:0
PINECONE_INDEX docs-embeddings
BATCH_SIZE 25
EMBEDDING_CONCURRENCY 3

Required Secret Update

Secret Value
CONTENT_BUCKET_NAME setu-docs-content-prod

Test plan

  • yarn build — both packages compile
  • yarn test — 93/93 tests pass
  • yarn dry-run — embedding dry-run validates 5,576 chunks
  • Pipeline: 5,318 embeddable (95.4%), 258 too small, 0 too large
  • GitHub variables set before merge
  • S3 bucket setu-docs-content-prod exists in ap-south-1

All configuration (AWS_REGION, BEDROCK_MODEL_ID, PINECONE_INDEX,
BATCH_SIZE, EMBEDDING_CONCURRENCY) must now be set via environment
variables or .env.local — no more hardcoded ap-south-1 or model ID
defaults. Fail-fast with clear error messages if any are missing.

This fixes the S3 PermanentRedirect error in the Deploy Knowledge Base
CI job, which was caused by the bucket being in us-east-1 while code
defaulted to ap-south-1. Bucket has been recreated as
setu-docs-content-prod in ap-south-1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Checklist to merge a PR 🚀

To merge this pull request, please take time to complete the checklist.

What action did you perform?

Review the corresponding checklist items for the action you performed and mark them done.

Edit an existing content (MDX) page

Checklist

  • Review changes using the MDX preview option
  • If the length of content >15000 chars, use the Content preview portal to view changes
  • If a redirect is needed to the existing page, add a key, value pair in redirects.json

Edit an existing API reference page

Checklist


Add a new content (MDX) page

Checklist

  • Create a .mdx file with the path as its name in the content folder
  • Add frontmatter with all the metadata
  • Review the order of items in Sidebar using the Sidebar preview option
  • Review changes using the MDX preview option
  • If the length of content >15000 chars, use the Content preview portal to view changes
  • Created a folder with the same name, if any children were to be added to the page
  • Once all changes are done, update the menu items by using the Menu Items option
  • Add a key, and value pair in redirects.json if you wish to have a redirect to the new page

Add a new API reference page

Checklist

  • Create a .json file with the product path as its name
  • Create an api-reference.mdx file in the respective product folder inside content folder
  • Add frontmatter with all the metadata
  • Review the order of items in Sidebar using the Sidebar preview option
  • Add API reference in JSON format (OpenAPI or Swagger) into created .json file.
  • Used the Content preview portal to view changes
  • Once all changes are done, update the menu items by using the Menu Items option

@pankaj0308 pankaj0308 self-requested a review March 25, 2026 13:52
@pulkit004 pulkit004 merged commit bb0d7ca into main Mar 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants