AI-powered credit appraisal engine for Indian corporate lending. Built for the IIT Hyderabad x Vivriti Capital hackathon (March 2026).
The system takes uploaded financial documents (annual reports, GST filings, bank statements, ALM data, shareholding patterns, borrowing profiles, portfolio reports), runs autonomous web research on the company and its promoters, scores credit risk through a three-layer hybrid engine, and produces a full Credit Appraisal Memo (CAM) in Word format. A credit officer portal lets the human review, override, and interrogate the AI's reasoning at every stage.
The core design principle is: the LLM explains, deterministic rules govern, and ML calibrates. No black boxes. Every decision traces back to a specific data source or rule.
- How It Works
- Architecture
- Repository Layout
- Tech Stack
- Getting Started
- Docker Compose
- Environment Variables
- API Reference
- Database Schema
- Document Ingestion
- Research Agent
- Risk Scoring Engine
- Report Generation
- Frontend Portal
- ML Training
- Testing
- Project Context
A credit officer starts an assessment by creating a company profile, optionally attaching a loan application, and uploading documents. The system then runs a multi-stage pipeline:
-
Document Classification -- Each uploaded file is auto-classified (ALM statement, shareholding pattern, borrowing profile, annual report, portfolio quality report, GST filing, bank statement, ITR, etc.) using filename heuristics with LLM fallback. The officer can review and override classifications before analysis.
-
Structured Extraction -- Type-specific parsers pull structured data from each document. Digital PDFs go through pdfplumber/PyMuPDF. Scanned or degraded PDFs are handled by Qwen2.5-VL (vision-language model via Hugging Face).
-
Cross-Validation -- GST declared revenue is compared against bank credits. ITC claims are checked against supplier filings. Circular trading patterns are detected through graph analysis of buyer-seller relationships.
-
Web Research -- An autonomous agent searches for the company, its directors, and its sector across news sources, MCA (Ministry of Corporate Affairs), eCourts, and regulatory databases. Findings are extracted, scored for severity, and stored.
-
SWOT Analysis -- A two-step LLM process generates a structured SWOT (Strengths, Weaknesses, Opportunities, Threats) with sector outlook and macro signals based on all gathered data.
-
Risk Scoring -- Three layers produce the final score:
- Layer 1: Deterministic rules engine (DSCR thresholds, leverage caps, DIN disqualification checks, circular trading flags). Critical rules are hard stops.
- Layer 2: XGBoost model calibrates with nonlinear feature interactions.
- Layer 3: Blending formula (60% rules, 40% ML) produces the final 0-100 score.
-
Decision -- Based on the blended score: APPROVE (75-100), CONDITIONAL APPROVE (35-74), or REJECT (0-34). Loan limits follow RBI Tandon Committee MPBF norms. Interest premiums are tiered by risk category.
-
CAM Generation -- A professional Credit Appraisal Memo is generated as a Word document (.docx), structured around the Five Cs of Credit (Character, Capacity, Capital, Collateral, Conditions). An investment-grade report is also available.
-
Chat -- The credit officer can ask natural language questions about the assessment. Answers are grounded in the actual documents and research findings via RAG over Qdrant.
The entire pipeline streams progress back to the frontend via SSE, so the officer sees exactly what the system is doing at each step.
NEXT.JS FRONTEND (port 3000)
/start /upload /classify /notes /pipeline
/score /results /explain /chat /cam
|
| REST + SSE
v
FASTAPI BACKEND (port 8000)
/api/v1/* endpoints
|
+-----------------+-----------------+
| | |
PIPELINE SERVICE POSTGRESQL QDRANT
(orchestrator) (all state) (vector search)
|
+------------+------------+------------+
| | | |
INGESTION RESEARCH SCORING REPORT
14 parsers web agent 3 layers CAM + SWOT
3 OCR tiers MCA/eCourts rules+ML investment report
news/sector SHAP explain .docx output
|
REDIS CELERY WORKER
(broker) (background tasks)
|
DELTA LAKE / DATABRICKS
(optional data sink)
backend/
main.py FastAPI app, middleware, router registration
config.py All env vars via pydantic-settings
database.py Async SQLAlchemy engine + session factory
models/
db_models.py 12 ORM models (companies, documents, risk_scores, ...)
schemas/ Pydantic request/response schemas
api/routes/
upload.py POST /companies, /documents (auto-classifies on upload)
analysis.py POST /analyze, GET /status (SSE), /results, /explain
GET /swot, GET /investment-report
loan.py POST/GET/PATCH /loan-applications
classification.py GET/PATCH document classifications (HITL review)
due_diligence.py POST /dd-input, GET /dd-preview
research.py GET /research findings
report.py GET /report (.docx), /report/pdf
chat.py POST /chat (RAG over documents + CAM)
health.py GET /health, /health/integrations
core/
pipeline_service.py Main orchestrator -- runs the full analysis pipeline
ingestion/
document_classifier.py Auto-classification (filename heuristics + LLM)
pdf_parser.py PyMuPDF + pdfplumber + Qwen2.5-VL OCR
qwen_vl_ocr.py Qwen2.5-VL vision-language OCR (Hugging Face)
alm_parser.py Asset-Liability Maturity parser
shareholding_parser.py Promoter/FII/DII/pledge data parser
borrowing_profile_parser.py Existing debt schedule parser
portfolio_parser.py AUM/NPA/collection efficiency parser
gst_parser.py GST return parser + ITC mismatch detection
bank_statement.py Bank statement anomaly extraction
itr_parser.py Income tax return parser
cross_validator.py Cross-source consistency checks
schema_extractor.py Dynamic field extraction via LLM
xlsx_financial_parser.py Excel financial statement parser
research/
web_agent.py Autonomous multi-source research orchestrator
crawl4ai_client.py Crawl4AI search + crawl client (Serper-backed)
finding_extractor.py LLM-powered finding extraction and scoring
search_strategies.py India-specific search query templates
mca_scraper.py MCA21 portal data
ecourt_scraper.py eCourts / DRT litigation search
news_scraper.py News intelligence
due_diligence_ai.py AI analysis of credit officer notes
swot_engine.py Two-step SWOT generation
research_to_delta.py CAM narrative synthesis from research
cibil_mock.py Mock CIBIL data for demo
ml/
feature_engineering.py 30+ feature builder from all data sources
credit_scorer.py XGBoost scoring wrapper
risk_rules.py Hard rejection rules (DSCR < 1, negative NW, etc.)
explainer.py SHAP + narrative explainability
llm/
llm_client.py Unified LLM client (Cerebras gpt-oss-120b only)
report/
cam_generator.py 9-section CAM Word document generator
five_c_analyzer.py Five Cs of Credit analysis
investment_report_generator.py Investment-grade SWOT report (.docx)
state_store.py In-memory pipeline state management
structured_logging.py JSON structured logging
india_context.py Indian banking terminology + sector data
scoring/
rules_engine.py Layer 1: deterministic rules (17 rules, 5 critical)
ml_calibrator.py Layer 2: XGBoost probability calibration
score_blender.py Layer 3: 60/40 blending + decision thresholds
shap_explainer.py SHAP feature importance computation
vector_store/
qdrant_client.py Qdrant collection init + upsert + semantic search
databricks/ Spark/Delta Lake integration (optional)
tasks/ Celery background task definitions
cam/ CAM template definitions
frontend/
app/
page.tsx Landing page
about/page.tsx About / architecture page
app/
start/page.tsx 3-step onboarding (entity + loan + review)
upload/page.tsx Document upload with drag-and-drop
classify/page.tsx Document classification review (HITL)
notes/page.tsx Credit officer due diligence notes
pipeline/page.tsx Live pipeline progress (SSE stream)
score/page.tsx Risk score summary + SHAP chart
results/page.tsx Full dashboard (Five Cs, anomalies, SWOT, research)
explain/page.tsx Decision explainability narrative
chat/page.tsx Chat with the CAM / ask questions
cam/page.tsx CAM preview + download
components/
UploadZone.tsx Drag-and-drop file upload
AgentProgressLog.tsx SSE pipeline progress display
ShapChart.tsx SHAP feature importance bar chart (Recharts)
RiskGauge.tsx Circular risk score meter
FiveCsRadar.tsx Five Cs radar chart
SwotMatrix.tsx 2x2 SWOT grid with evidence citations
TimelineView.tsx Analysis timeline
ResearchFeed.tsx Research findings feed
StressTestPanel.tsx Stress test scenarios
FraudGraph.tsx Circular trading graph (D3)
AnomalyFlags.tsx Anomaly/flag display
CamPreview.tsx CAM document preview
ChatInterface.tsx Chat UI
DocumentViewer.tsx Document content viewer
lib/api.ts API client (auto-discovers backend port)
store/analysisStore.ts Zustand persisted state
ml/
train_model.py XGBoost training script
features.py Feature engineering for training
synthetic_data.py Synthetic Indian corporate credit data generator
model/ Trained model artifacts (.joblib)
data/sample/ Sample documents for demo
scripts/
generate_sample_data.py Creates sample documents for testing
warm_cache.py Pre-warms research cache
tests/backend/ Backend unit tests
Backend: Python 3.11, FastAPI, SQLAlchemy (async), Pydantic v2, Celery, Prefect
Frontend: Next.js 14 (App Router), React 18, TypeScript, Tailwind CSS, Recharts, D3, Zustand
Database: PostgreSQL 16 (primary store), Qdrant (vector search), Redis (task broker)
ML: XGBoost, scikit-learn, SHAP, pandas, NetworkX (graph fraud detection)
OCR: PyMuPDF + pdfplumber (digital PDFs), Qwen2.5-VL via Hugging Face (scanned PDFs)
LLM: Cerebras gpt-oss-120b only
Research: Crawl4AI (Serper-backed search) + MCA/eCourts scrapers + news intelligence
Reports: python-docx (Word generation)
Infrastructure: Docker Compose, optional Databricks/Delta Lake sink
Prerequisites: Python 3.11+, Node.js 18+, PostgreSQL 16, Docker (optional).
git clone <repo-url>
cd Intelli-Credit
cp .env.example .env
# Edit .env with your API keys (CEREBRAS_API_KEY required; HUGGINGFACE_API_TOKEN needed for OCR)You can either use Docker for Postgres + Qdrant + Redis, or run them locally:
# Option A: standalone containers
docker run -d --name intelli_postgres -e POSTGRES_DB=intellicredit -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=password -p 5432:5432 postgres:16
docker run -d --name intelli_qdrant -p 6333:6333 qdrant/qdrant:latest
docker run -d --name intelli_redis -p 6379:6379 redis:7-alpine
# Option B: docker compose (starts everything including backend + frontend)
docker compose up --buildpython -m venv venv
# Linux/Mac:
source venv/bin/activate
# Windows:
venv\Scripts\activate
pip install -r requirements.txt
python -m uvicorn backend.main:app --host 0.0.0.0 --port 8000The backend auto-creates all database tables on startup.
cd frontend
npm install
npm run dev- Frontend: http://localhost:3000
- API docs (Swagger): http://localhost:8000/docs
- Health check: http://localhost:8000/health
python scripts/generate_sample_data.pydocker compose up --buildRuns six services:
| Service | Port | Purpose |
|---|---|---|
| postgres | 5432 | Primary database |
| qdrant | 6333 | Vector store for document/research chunks |
| redis | 6379 | Celery task broker |
| backend | 8001 | FastAPI application (mapped to internal 8000) |
| worker | -- | Celery background worker |
| frontend | 3000 | Next.js credit officer portal |
All services have health checks. Backend waits for Postgres, Qdrant, and Redis to be ready before starting.
Copy .env.example to .env. The important ones:
Required (LLM):
CEREBRAS_API_KEY-- Cerebras gpt-oss-120b (only LLM provider)
Database:
DATABASE_URL-- Postgres connection string (default: local)QDRANT_URL-- Qdrant endpoint (default: localhost:6333)REDIS_URL-- Redis endpoint (default: localhost:6379)
Research (for live mode):
SERPER_API_KEY-- Google search for Crawl4AIRESEARCH_MODE--mock(default, deterministic) orlive(real web search)
OCR:
HUGGINGFACE_API_TOKEN-- Hugging Face token for Qwen2.5-VL OCRQWEN_VL_MODEL-- Vision model for scanned PDFs (default: Qwen2.5-VL-7B-Instruct)
Optional:
DATABRICKS_HOST,DATABRICKS_TOKEN-- Delta Lake sink
Full list with defaults is in backend/config.py.
All endpoints are under /api/v1. Every response follows this envelope:
{
"status": "success",
"data": { },
"meta": {
"request_id": "uuid",
"timestamp": "2026-03-11T12:00:00Z",
"processing_time_ms": 142
}
}| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/companies |
Create company profile |
| POST | /api/v1/companies/{id}/documents |
Upload documents (auto-classified) |
| GET | /api/v1/companies/{id}/classifications |
List document classifications |
| PATCH | /api/v1/companies/{id}/classifications/{cls_id} |
Approve/override classification |
| POST | /api/v1/companies/{id}/classifications/{cls_id}/extract |
Trigger extraction for a doc |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/companies/{id}/loan-applications |
Create loan application |
| GET | /api/v1/companies/{id}/loan-applications |
List loan applications |
| PATCH | /api/v1/loan-applications/{loan_id} |
Update loan application |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/companies/{id}/analyze |
Start full analysis pipeline |
| GET | /api/v1/companies/{id}/status |
SSE stream of pipeline progress |
| GET | /api/v1/companies/{id}/results |
Full analysis results |
| GET | /api/v1/companies/{id}/explain |
Decision explainability |
| GET | /api/v1/companies/{id}/swot |
SWOT analysis data |
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/v1/companies/{id}/dd-input |
Submit credit officer notes |
| GET | /api/v1/companies/{id}/dd-preview |
AI preview of DD notes |
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/companies/{id}/research |
Research findings |
| GET | /api/v1/companies/{id}/report |
Download CAM (.docx) |
| GET | /api/v1/companies/{id}/report/pdf |
Download CAM (.pdf) |
| GET | /api/v1/companies/{id}/investment-report |
Download investment report (.docx) |
| POST | /api/v1/companies/{id}/chat |
RAG chat over documents + CAM |
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/health |
Health check |
| GET | /api/v1/health/integrations |
Integration status (DB, Qdrant, LLM) |
PostgreSQL with 12 tables:
| Table | Purpose |
|---|---|
companies |
Company profiles (name, CIN, GSTIN, sector, PAN, turnover, etc.) |
documents |
Uploaded files with extracted text and structured data (JSONB) |
document_classifications |
Auto-classification results + human review overrides |
loan_applications |
Loan type, amount, tenure, collateral, status |
analysis_runs |
Pipeline execution tracking (status, progress %, audit log) |
risk_scores |
Scoring output (rule score, ML probability, final score, SHAP values, decision) |
swot_analyses |
SWOT data (strengths/weaknesses/opportunities/threats + sector outlook) |
cam_outputs |
Generated CAM text and file paths |
qualitative_inputs |
Credit officer notes, site visit data, management assessment |
research_finding_records |
Normalized research findings with severity and confidence |
due_diligence_records |
DD inputs and AI-parsed insights |
chat_history |
Conversation history for RAG chat |
Tables are created automatically on startup via SQLAlchemy create_all. For production, use Alembic migrations.
- ALM Statement -- Asset-liability maturity buckets, liquidity gaps
- Shareholding Pattern -- Promoter, FII, DII, public holdings, pledge percentages
- Borrowing Profile -- Existing debt schedule, lenders, rates, maturities
- Portfolio Quality Report -- AUM, GNPA, NNPA, collection efficiency, provision coverage
- Annual Report -- Revenue, EBITDA, PAT, balance sheet, auditor opinion, directors
- GST Filing -- GSTR-3B/1 data, outward supplies, ITC claimed
- Bank Statement -- Transaction patterns, bounced cheques, EMI debits
- ITR -- Income tax return data
- General / Unknown -- Handled by the dynamic schema extractor
Two tiers, used in order:
- pdfplumber + PyMuPDF -- Fast, accurate for born-digital PDFs
- Qwen2.5-VL -- Vision-language model via Hugging Face Inference API for scanned PDFs
On upload, each document is automatically classified:
- Filename pattern matching (e.g., "alm" in filename -> ALM_STATEMENT)
- First-page text keyword scoring
- LLM fallback for ambiguous documents
The credit officer reviews classifications on the /classify page and can override before analysis runs.
The web research module runs autonomously during the analysis pipeline. It searches for:
- Company news (fraud, defaults, NPA, regulatory actions)
- MCA21 data (CIN, directors, DIN disqualification, registered charges)
- eCourts / DRT litigation (pending cases, recovery suits)
- Promoter background (fraud news, wilful defaulter lists)
- Sector conditions (RBI circulars, regulatory headwinds)
Research mode is controlled by RESEARCH_MODE:
mock-- Returns deterministic cached results (safe for development)live-- Real web searches via Crawl4AI + scraping
Findings are scored by severity (critical/high/medium/low), stored in PostgreSQL, and embedded in Qdrant for RAG queries.
17 rules encoding Indian banking underwriting norms. Examples:
| Rule | Severity | Penalty |
|---|---|---|
| DSCR below 1.0 | CRITICAL | -40 |
| Negative net worth | CRITICAL | -50 |
| Circular trading detected | CRITICAL | -60 |
| Director DIN disqualified | CRITICAL | -50 |
| Active recovery suit / DRT case | CRITICAL | -40 |
| Debt-to-Equity above 3x | HIGH | -20 |
| Current ratio below 1.0 | HIGH | -15 |
| GST-bank mismatch above 25% | HIGH | -25 |
| Qualified auditor opinion | HIGH | -20 |
| 3+ active litigations | HIGH | -20 |
| Revenue declining 3 years | MEDIUM | -10 |
| Factory capacity below 50% | MEDIUM | -15 |
Base score is 100. Rules subtract. Critical rules are hard stops that force REJECT regardless of ML output.
A trained model takes 30+ features and outputs a stress probability (0 to 1). Features include DSCR, EBITDA margin, leverage, GST mismatch %, litigation count, promoter flags, bounced cheques, sector risk index, ALM liquidity gaps, promoter holding %, pledged shares %, existing debt load, GNPA %, and collection efficiency.
Final Score = 0.6 x Rule Score + 0.4 x (1 - ML Stress Probability) x 100
| Score Range | Category | Decision |
|---|---|---|
| 75 - 100 | LOW risk | APPROVE |
| 55 - 74 | MODERATE risk | CONDITIONAL APPROVE |
| 35 - 54 | HIGH risk | CONDITIONAL APPROVE with enhanced collateral |
| 0 - 34 | CRITICAL risk | REJECT |
Loan limits follow RBI Tandon Committee MPBF norms:
MPBF = 0.75 x (Current Assets - Current Liabilities)
Recommended Limit = MPBF x (Score / 100)
Interest premium over MCLR: LOW +50bps, MODERATE +150bps, HIGH +300bps.
SHAP values are computed for every prediction, showing exactly how much each feature pushed the score up or down. The frontend renders this as a horizontal bar chart. A narrative explainability layer translates the numbers into plain English.
A Word document (.docx) structured around the Five Cs of Credit:
- Executive Summary
- Borrower Profile
- Character (promoter integrity, track record, MCA/court findings)
- Capacity (cash flows, DSCR, repayment ability)
- Capital (net worth, leverage, balance sheet strength)
- Collateral (assets pledged, security coverage)
- Conditions (sector outlook, macro factors, RBI regulations)
- Risk Assessment and Red Flags
- Recommendation (decision, limit, pricing, covenants or rejection rationale)
The LLM writes the prose. All numbers, scores, and decisions come from the scoring engine, not the LLM. The LLM is given structured data and told to explain, not decide.
A separate SWOT-focused report with 9 sections: company overview, SWOT matrix, financial highlights, risk assessment, sector analysis, key metrics, research findings, recommendation, and disclaimer.
The credit officer portal is a Next.js application with these pages:
Onboarding:
/app/start-- Three-step form: entity profile, loan details, review and submit/app/upload-- Drag-and-drop document upload (PDF, DOCX, CSV, XLS, images)/app/classify-- Review auto-classifications, approve or override document types/app/notes-- Credit officer due diligence notes with real-time AI preview
Analysis:
/app/pipeline-- Live pipeline progress via SSE (shows each step as it runs)/app/score-- Risk score summary with SHAP feature importance chart/app/results-- Full dashboard: Five Cs radar, anomaly flags, SWOT matrix, research feed, stress test panel, fraud graph, investment report download/app/explain-- Decision explainability narrative with Indian banking glossary
Output:
/app/cam-- CAM document preview and download/app/chat-- Ask questions about the assessment (RAG-grounded answers)
The frontend uses a dark theme with the design system classes from the project's Tailwind config. State is managed through Zustand with persistence, so the officer can navigate between pages without losing context.
To retrain the XGBoost model on new data:
# Generate synthetic training data
python ml/synthetic_data.py
# Train the model
python ml/train_model.pyThe trained model is saved to ml/model/. The pipeline loads it automatically.
The synthetic data generator creates Indian corporate credit profiles with realistic distributions for DSCR, leverage, margins, sector mix, and stress outcomes.
# Activate your virtual environment first
pytest -q tests/backendTests cover the rules engine, GST mismatch detection, circular trading detection, and feature engineering.
Built for the IIT Hyderabad x Vivriti Capital "Intelli-Credit" hackathon. The problem statement asks for three deliverables:
-
Data Ingestor -- Multi-format document ingestion with high-latency pipeline support. We handle PDF, DOCX, CSV, XML, XLS, XLSX, JPEG, PNG with a three-tier OCR stack and type-specific parsers for ALM, shareholding, borrowing profile, portfolio, GST, bank statements, and ITR.
-
Digital Credit Manager (Research Agent) -- Autonomous web research across MCA, eCourts, news, and regulatory sources. Includes credit officer due diligence input with AI-powered insight extraction. Human-in-the-loop at document classification and before scoring.
-
Recommendation Engine -- Explainable three-layer scoring (rules + ML + blending), SHAP-based feature importance, CAM generation in Word format, investment report, and RAG-powered chat.
The system is designed for Indian corporate lending specifically. It uses Indian banking terminology (MPBF, DSCR, TOL/TNW, MCLR), follows RBI Tandon Committee norms for loan limits, and handles scanned documents through Hugging Face Qwen2.5-VL OCR.
For detailed architecture diagrams, see ARCHITECTURE.md.