Skip to content

devnadeemashraf/vantage-backend

Repository files navigation

Vantage — High-Performance B2B Company Search Directory

A production-grade backend that ingests the Australian Business Register (11.5 GB XML, 20M+ businesses) and exposes a search API with full-text search (PostgreSQL tsvector + GIN), filtered queries, and optional baseline (ILIKE) for performance comparison.

Built with Node.js 24, Express 5, TypeScript 5.9, and PostgreSQL 17.


Highlights

Capability Implementation
~600 MB XML ingested in ~90 seconds SAX streaming parser in a dedicated Worker Thread — zero main-thread blocking
Sub-50 ms search (optimized path) PostgreSQL tsvector + GIN index on search_vector; optional native (ILIKE) for baseline (~350 ms on 9M rows)
Multi-process HTTP Node.js cluster module forks one worker per CPU core for near-linear throughput scaling
Layered architecture Clear layer separation (Domain, Application, Infrastructure, Interfaces) with Dependency Injection
AI-ready Abstraction layer for plugging in a Text-to-SQL engine (OpenAI, SQLCoder, etc.)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Entry Points                             │
│   server.ts (Cluster Primary)     seed.ts (CLI Ingestion)       │
├─────────────────────────────────────────────────────────────────┤
│                     Interfaces Layer                            │
│   Routes → Controllers → Middleware (Auth, Validation, Errors)  │
├─────────────────────────────────────────────────────────────────┤
│                    Application Layer                            │
│   SearchService ← SearchStrategyFactory → Native/Optimized      │
│   IngestionService (Facade) → Worker Thread                     │
├─────────────────────────────────────────────────────────────────┤
│                      Domain Layer                               │
│   Business Entity │ IBusinessRepository │ ISearchStrategy       │
│   IDataSourceAdapter │ ITextToSqlEngine                         │
├─────────────────────────────────────────────────────────────────┤
│                   Infrastructure Layer                          │
│   PostgresBusinessRepository │ searchNative / searchOptimized   │
│   Knex Migrations │   businesses, business_names, trigger + GIN │
├─────────────────────────────────────────────────────────────────┤
│                     Workers Layer                               │
│   ETL Worker Thread → SAX Parser → XmlAdapter → BatchProcessor  │
└─────────────────────────────────────────────────────────────────┘

Design Patterns Used

Pattern Where Why
Repository IBusinessRepository / PostgresBusinessRepository Decouple domain logic from database engine — swap Postgres for Elasticsearch without touching services
Strategy ISearchStrategy / NativeSearchStrategy / OptimizedSearchStrategy Swap search by request technique: native (ILIKE) vs optimized (tsvector + GIN); AI via mode (future)
Factory SearchStrategyFactory Picks strategy from query.technique and query.mode — one new class + one case per technique
Adapter XmlDataSourceAdapter Normalize raw ABR XML into domain entities — add a JSON adapter without changing the ETL pipeline
Facade IngestionService Hide the complexity of worker threads, SAX parsing, and batch processing behind ingest(filePath)
Singleton getDbConnection() One connection pool per process — in a clustered setup, each worker gets its own pool
Dependency Injection tsyringe + Symbol tokens Wire everything in one place (container.ts) — swap implementations by changing one line

Tech Stack

Category Technology Version
Runtime Node.js 24.x LTS
Framework Express 5.2.1
Language TypeScript 5.9.3 (strict mode)
Database PostgreSQL 17 (via Docker)
Query Builder Knex.js 3.x
Validation Zod 4.x
Logging Pino 10.x (JSON in prod, pretty in dev)
XML Parsing SAX 1.4.x (streaming, constant memory)
DI Container tsyringe 4.x
Linting ESLint 10 + typescript-eslint Flat config
Formatting Prettier 3.x

Database & Search

Schema (migrations 001–003)

  • businesses: One row per ABN; columns include entity_name, state, postcode, and a search_vector (TSVECTOR) column. B-tree indexes on abn_status, entity_type_code, state, postcode speed up filters.
  • business_names: 1-to-many alternate names per business; FK to businesses, index on business_id.
  • Migration 003: Trigger maintains search_vector on INSERT/UPDATE (weights: entity_name A, given/family name B, state/postcode C), one-time backfill for existing rows, and a GIN index on businesses(search_vector) so full-text matches use the index.

Two search techniques

Technique Query param How it works Latency (e.g. 9M rows)
native technique=native (default) entity_name ILIKE '%term%'; no index, sequential scan ~350 ms
optimized technique=optimized search_vector @@ to_tsquery('english', …) using the GIN index Sub-50 ms for selective terms

Both use the same filters (state, postcode, entityType, abnStatus) and pagination. Filter-only requests (no q) use findWithFilters() for both techniques.

Configurable candidate cap

  • SEARCH_MAX_CANDIDATES (env: SEARCH_MAX_CANDIDATES, default 5000): Caps the number of candidate rows used for total count and pagination. You can still request "next 100" results; the cap keeps count and data queries bounded for consistent response times.

Project Structure

src/
├── core/                          # App bootstrap — config, logger, DI container, tokens
│   ├── config.ts                  # Zod-validated environment config (single source of truth)
│   ├── logger.ts                  # Pino structured logger
│   ├── types.ts                   # DI injection tokens (Symbols)
│   └── container.ts               # tsyringe DI wiring
│
├── domain/                        # Pure business logic — no external dependencies
│   ├── entities/
│   │   ├── Business.ts            # Core entity (camelCase) + DB row (snake_case)
│   │   └── BusinessName.ts        # 1-to-many trading/business names
│   └── interfaces/
│       ├── IBusinessRepository.ts # Data access contract
│       ├── ISearchStrategy.ts     # Search algorithm contract
│       ├── IDataSourceAdapter.ts  # Data normalisation contract
│       └── ITextToSqlEngine.ts    # AI search abstraction (future)
│
├── shared/                        # Cross-cutting types, errors, constants
│   ├── types.ts                   # SearchQuery, PaginatedResult, IngestionResult
│   ├── errors/AppError.ts         # Operational vs programmer error hierarchy
│   └── constants.ts               # ABR reference codes, pagination defaults
│
├── infrastructure/                # External service implementations
│   ├── database/
│   │   ├── connection.ts          # Singleton connection pool
│   │   └── migrations/            # Versioned schema changes (001, 002, 003)
│   └── repositories/
│       └── PostgresBusinessRepository.ts  # searchNative (ILIKE), searchOptimized (tsvector @@), findWithFilters
│
├── application/                   # Use cases and orchestration
│   ├── strategies/
│   │   ├── NativeSearchStrategy.ts    # ILIKE baseline
│   │   └── OptimizedSearchStrategy.ts  # tsvector + GIN
│   ├── factories/
│   │   └── SearchStrategyFactory.ts
│   └── services/
│       ├── SearchService.ts       # Search orchestrator
│       └── IngestionService.ts    # ETL facade (spawns worker threads)
│
├── workers/                       # Background processing (separate V8 isolates)
│   └── etl/
│       ├── etlWorker.ts           # SAX streaming XML parser
│       ├── XmlDataSourceAdapter.ts # ABR XML → Business entity transformer
│       └── batchProcessor.ts      # Chunked bulk upsert engine
│
├── interfaces/                    # HTTP layer
│   └── http/
│       ├── app.ts                 # Express app factory
│       ├── middleware/            # Error handler, validation, request logger
│       ├── controllers/          # Thin request/response handlers
│       └── routes/               # URL → controller mapping
│
├── scripts/
│   └── seed.ts                    # CLI data ingestion with progress reporting
│
└── server.ts                      # Clustered entry point (1 worker per CPU core)

Getting Started

Prerequisites

  • Node.js >= 24.x
  • Docker & Docker Compose (for PostgreSQL)

Setup

# 1. Clone and install
git clone https://github.com/devnadeemashraf/vantage-backend.git
cd vantage-backend
npm install

# 2. Configure environment
cp .env.example .env
# Edit .env with your database credentials

# 3. Start PostgreSQL
docker compose up -d

# 4. Run migrations
npm run migrate

# 5. Seed the database (580MB XML file in ./temp/data)
# NOTE: You have to download the dataset manually and place it in the ~/temp/data path
npm run seed -- --file ./temp/data/20260211_Public20.xml

# 6. Start the server
npm run dev

Available Scripts

Command Description
npm run dev Start dev server with hot reload (tsx watch)
npm run build Compile TypeScript + resolve path aliases
npm start Run production build (clustered)
npm run migrate Run pending database migrations
npm run migrate:rollback Rollback the last migration batch
npm run seed Ingest XML data via CLI with progress output
npm run seed:migrate Run migrations then seed in one command
npm run lint Run ESLint on src/
npm run format Format all TypeScript files with Prettier

API Endpoints

Search Businesses

GET /api/v1/businesses/search?q=plumbing&state=NSW&page=1&limit=20

Response includes meta with timing for UI display:

{
  "status": "success",
  "data": [...],
  "pagination": { "page": 1, "limit": 20, "total": 123, "totalPages": 7 },
  "meta": {
    "totalTimeMs": 42,
    "queryTimeMs": 28
  }
}
  • totalTimeMs — wall-clock time from request arrival to response sent (ms)
  • queryTimeMs — time spent executing database queries (ms)

Query Parameters:

Param Type Description
q string Search term (matched by ILIKE or full-text depending on technique)
state string Filter by Australian state (NSW, VIC, QLD...)
postcode string Filter by postcode
entityType string Filter by entity type code (IND, PRV, PUB...)
abnStatus string Filter by ABN status (ACT, CAN)
page number Page number (default: 1)
limit number Results per page (default: 20, max: 100)
mode string Search mode: standard or ai (future)
technique string Search technique: native (ILIKE baseline) or optimized (index-backed when available). Default: native. Use to compare query performance.

Lookup by ABN

GET /api/v1/businesses/12345678901

Response includes meta with timing:

{
  "status": "success",
  "data": { "abn": "12345678901", "entityName": "...", ... },
  "meta": { "totalTimeMs": 15, "queryTimeMs": 8 }
}

Trigger Ingestion

POST /api/v1/ingest
Content-Type: application/json

{ "filePath": "./temp/data/20260211_Public20.xml" }

Health Check

GET /api/v1/health

ETL Pipeline

The ingestion pipeline processes 580 MB XML files (~500K records) in approximately 90 seconds:

XML File (580 MB)
    │
    ▼
FileReadStream (64 KB chunks)
    │
    ▼
SAX Parser (event-driven, constant memory)
    │  opentag → closetag → text events
    ▼
XmlDataSourceAdapter (Adapter Pattern)
    │  Raw XML → Business domain entity
    ▼
BatchProcessor (5000 records/batch)
    │  Chunked INSERT ON CONFLICT MERGE
    │  (respects PG's 65,535 parameter limit)
    ▼
PostgreSQL (with auto-triggered search_vector update)

Key design decisions:

  • Worker Thread isolation: XML parsing runs in a separate V8 isolate so the HTTP server stays responsive during ingestion.
  • SAX streaming: Constant ~50 MB memory usage regardless of file size (vs ~3 GB for DOM parsing).
  • Batch upserts: INSERT ON CONFLICT MERGE makes re-runs idempotent — same file can be ingested twice without duplicates.
  • Parameter chunking: Batches are split into sub-batches of ~4,681 rows to stay under PostgreSQL's 65,535 bind parameter limit.

Scalability Path

The architecture is designed to scale to 20+ ingested XML files (12+ GB total, 16M+ rows):

Concern Current Scale Strategy
Search latency GIN on search_vector (optimized technique) Add pg_partman partitioning; optional pg_trgm for typo tolerance
Connection pressure 10 connections/worker PgBouncer connection pooler in front of PostgreSQL
Read throughput Cluster module (N workers) Read replicas + load balancer
Write throughput Chunked upserts COPY protocol for initial loads, upserts for incremental
AI search Interface stub Plug in SQLCoder or OpenAI wrapper via ITextToSqlEngine
Caching None (fast enough for prototype) Redis with cache-aside pattern on hot search queries
Monitoring Pino JSON logs Prometheus metrics + Grafana dashboards

About

Vantage - High-performance B2B Company Search Directory backed by the Australian Business Register

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors