Vantage — High-Performance B2B Company Search Directory

A production-grade backend that ingests the Australian Business Register (11.5 GB XML, 20M+ businesses) and exposes a search API with full-text search (PostgreSQL tsvector + GIN), filtered queries, and optional baseline (ILIKE) for performance comparison.

Built with Node.js 24, Express 5, TypeScript 5.9, and PostgreSQL 17.

Highlights

Capability	Implementation
~600 MB XML ingested in ~90 seconds	SAX streaming parser in a dedicated Worker Thread — zero main-thread blocking
Sub-50 ms search (optimized path)	PostgreSQL `tsvector` + GIN index on `search_vector`; optional `native` (ILIKE) for baseline (~350 ms on 9M rows)
Multi-process HTTP	Node.js `cluster` module forks one worker per CPU core for near-linear throughput scaling
Layered architecture	Clear layer separation (Domain, Application, Infrastructure, Interfaces) with Dependency Injection
AI-ready	Abstraction layer for plugging in a Text-to-SQL engine (OpenAI, SQLCoder, etc.)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Entry Points                             │
│   server.ts (Cluster Primary)     seed.ts (CLI Ingestion)       │
├─────────────────────────────────────────────────────────────────┤
│                     Interfaces Layer                            │
│   Routes → Controllers → Middleware (Auth, Validation, Errors)  │
├─────────────────────────────────────────────────────────────────┤
│                    Application Layer                            │
│   SearchService ← SearchStrategyFactory → Native/Optimized      │
│   IngestionService (Facade) → Worker Thread                     │
├─────────────────────────────────────────────────────────────────┤
│                      Domain Layer                               │
│   Business Entity │ IBusinessRepository │ ISearchStrategy       │
│   IDataSourceAdapter │ ITextToSqlEngine                         │
├─────────────────────────────────────────────────────────────────┤
│                   Infrastructure Layer                          │
│   PostgresBusinessRepository │ searchNative / searchOptimized   │
│   Knex Migrations │   businesses, business_names, trigger + GIN │
├─────────────────────────────────────────────────────────────────┤
│                     Workers Layer                               │
│   ETL Worker Thread → SAX Parser → XmlAdapter → BatchProcessor  │
└─────────────────────────────────────────────────────────────────┘

Design Patterns Used

Pattern	Where	Why
Repository	`IBusinessRepository` / `PostgresBusinessRepository`	Decouple domain logic from database engine — swap Postgres for Elasticsearch without touching services
Strategy	`ISearchStrategy` / `NativeSearchStrategy` / `OptimizedSearchStrategy`	Swap search by request `technique`: native (ILIKE) vs optimized (tsvector + GIN); AI via `mode` (future)
Factory	`SearchStrategyFactory`	Picks strategy from `query.technique` and `query.mode` — one new class + one `case` per technique
Adapter	`XmlDataSourceAdapter`	Normalize raw ABR XML into domain entities — add a JSON adapter without changing the ETL pipeline
Facade	`IngestionService`	Hide the complexity of worker threads, SAX parsing, and batch processing behind `ingest(filePath)`
Singleton	`getDbConnection()`	One connection pool per process — in a clustered setup, each worker gets its own pool
Dependency Injection	`tsyringe` + Symbol tokens	Wire everything in one place (`container.ts`) — swap implementations by changing one line

Tech Stack

Category	Technology	Version
Runtime	Node.js	24.x LTS
Framework	Express	5.2.1
Language	TypeScript	5.9.3 (strict mode)
Database	PostgreSQL	17 (via Docker)
Query Builder	Knex.js	3.x
Validation	Zod	4.x
Logging	Pino	10.x (JSON in prod, pretty in dev)
XML Parsing	SAX	1.4.x (streaming, constant memory)
DI Container	tsyringe	4.x
Linting	ESLint 10 + typescript-eslint	Flat config
Formatting	Prettier	3.x

Database & Search

Schema (migrations 001–003)

businesses: One row per ABN; columns include entity_name, state, postcode, and a search_vector (TSVECTOR) column. B-tree indexes on abn_status, entity_type_code, state, postcode speed up filters.
business_names: 1-to-many alternate names per business; FK to businesses, index on business_id.
Migration 003: Trigger maintains search_vector on INSERT/UPDATE (weights: entity_name A, given/family name B, state/postcode C), one-time backfill for existing rows, and a GIN index on businesses(search_vector) so full-text matches use the index.

Two search techniques

Technique	Query param	How it works	Latency (e.g. 9M rows)
native	`technique=native` (default)	`entity_name ILIKE '%term%'`; no index, sequential scan	~350 ms
optimized	`technique=optimized`	`search_vector @@ to_tsquery('english', …)` using the GIN index	Sub-50 ms for selective terms

Both use the same filters (state, postcode, entityType, abnStatus) and pagination. Filter-only requests (no q) use findWithFilters() for both techniques.

Configurable candidate cap

SEARCH_MAX_CANDIDATES (env: SEARCH_MAX_CANDIDATES, default 5000): Caps the number of candidate rows used for total count and pagination. You can still request "next 100" results; the cap keeps count and data queries bounded for consistent response times.

Project Structure

src/
├── core/                          # App bootstrap — config, logger, DI container, tokens
│   ├── config.ts                  # Zod-validated environment config (single source of truth)
│   ├── logger.ts                  # Pino structured logger
│   ├── types.ts                   # DI injection tokens (Symbols)
│   └── container.ts               # tsyringe DI wiring
│
├── domain/                        # Pure business logic — no external dependencies
│   ├── entities/
│   │   ├── Business.ts            # Core entity (camelCase) + DB row (snake_case)
│   │   └── BusinessName.ts        # 1-to-many trading/business names
│   └── interfaces/
│       ├── IBusinessRepository.ts # Data access contract
│       ├── ISearchStrategy.ts     # Search algorithm contract
│       ├── IDataSourceAdapter.ts  # Data normalisation contract
│       └── ITextToSqlEngine.ts    # AI search abstraction (future)
│
├── shared/                        # Cross-cutting types, errors, constants
│   ├── types.ts                   # SearchQuery, PaginatedResult, IngestionResult
│   ├── errors/AppError.ts         # Operational vs programmer error hierarchy
│   └── constants.ts               # ABR reference codes, pagination defaults
│
├── infrastructure/                # External service implementations
│   ├── database/
│   │   ├── connection.ts          # Singleton connection pool
│   │   └── migrations/            # Versioned schema changes (001, 002, 003)
│   └── repositories/
│       └── PostgresBusinessRepository.ts  # searchNative (ILIKE), searchOptimized (tsvector @@), findWithFilters
│
├── application/                   # Use cases and orchestration
│   ├── strategies/
│   │   ├── NativeSearchStrategy.ts    # ILIKE baseline
│   │   └── OptimizedSearchStrategy.ts  # tsvector + GIN
│   ├── factories/
│   │   └── SearchStrategyFactory.ts
│   └── services/
│       ├── SearchService.ts       # Search orchestrator
│       └── IngestionService.ts    # ETL facade (spawns worker threads)
│
├── workers/                       # Background processing (separate V8 isolates)
│   └── etl/
│       ├── etlWorker.ts           # SAX streaming XML parser
│       ├── XmlDataSourceAdapter.ts # ABR XML → Business entity transformer
│       └── batchProcessor.ts      # Chunked bulk upsert engine
│
├── interfaces/                    # HTTP layer
│   └── http/
│       ├── app.ts                 # Express app factory
│       ├── middleware/            # Error handler, validation, request logger
│       ├── controllers/          # Thin request/response handlers
│       └── routes/               # URL → controller mapping
│
├── scripts/
│   └── seed.ts                    # CLI data ingestion with progress reporting
│
└── server.ts                      # Clustered entry point (1 worker per CPU core)

Getting Started

Prerequisites

Node.js >= 24.x
Docker & Docker Compose (for PostgreSQL)

Setup

# 1. Clone and install
git clone https://github.com/devnadeemashraf/vantage-backend.git
cd vantage-backend
npm install

# 2. Configure environment
cp .env.example .env
# Edit .env with your database credentials

# 3. Start PostgreSQL
docker compose up -d

# 4. Run migrations
npm run migrate

# 5. Seed the database (580MB XML file in ./temp/data)
# NOTE: You have to download the dataset manually and place it in the ~/temp/data path
npm run seed -- --file ./temp/data/20260211_Public20.xml

# 6. Start the server
npm run dev

Available Scripts

Command	Description
`npm run dev`	Start dev server with hot reload (tsx watch)
`npm run build`	Compile TypeScript + resolve path aliases
`npm start`	Run production build (clustered)
`npm run migrate`	Run pending database migrations
`npm run migrate:rollback`	Rollback the last migration batch
`npm run seed`	Ingest XML data via CLI with progress output
`npm run seed:migrate`	Run migrations then seed in one command
`npm run lint`	Run ESLint on src/
`npm run format`	Format all TypeScript files with Prettier

API Endpoints

Search Businesses

GET /api/v1/businesses/search?q=plumbing&state=NSW&page=1&limit=20

Response includes meta with timing for UI display:

{
  "status": "success",
  "data": [...],
  "pagination": { "page": 1, "limit": 20, "total": 123, "totalPages": 7 },
  "meta": {
    "totalTimeMs": 42,
    "queryTimeMs": 28
  }
}

totalTimeMs — wall-clock time from request arrival to response sent (ms)
queryTimeMs — time spent executing database queries (ms)

Query Parameters:

Param	Type	Description
`q`	string	Search term (matched by ILIKE or full-text depending on `technique`)
`state`	string	Filter by Australian state (NSW, VIC, QLD...)
`postcode`	string	Filter by postcode
`entityType`	string	Filter by entity type code (IND, PRV, PUB...)
`abnStatus`	string	Filter by ABN status (ACT, CAN)
`page`	number	Page number (default: 1)
`limit`	number	Results per page (default: 20, max: 100)
`mode`	string	Search mode: `standard` or `ai` (future)
`technique`	string	Search technique: `native` (ILIKE baseline) or `optimized` (index-backed when available). Default: `native`. Use to compare query performance.

Lookup by ABN

GET /api/v1/businesses/12345678901

Response includes meta with timing:

{
  "status": "success",
  "data": { "abn": "12345678901", "entityName": "...", ... },
  "meta": { "totalTimeMs": 15, "queryTimeMs": 8 }
}

Trigger Ingestion

POST /api/v1/ingest
Content-Type: application/json

{ "filePath": "./temp/data/20260211_Public20.xml" }

Health Check

GET /api/v1/health

ETL Pipeline

The ingestion pipeline processes 580 MB XML files (~500K records) in approximately 90 seconds:

XML File (580 MB)
    │
    ▼
FileReadStream (64 KB chunks)
    │
    ▼
SAX Parser (event-driven, constant memory)
    │  opentag → closetag → text events
    ▼
XmlDataSourceAdapter (Adapter Pattern)
    │  Raw XML → Business domain entity
    ▼
BatchProcessor (5000 records/batch)
    │  Chunked INSERT ON CONFLICT MERGE
    │  (respects PG's 65,535 parameter limit)
    ▼
PostgreSQL (with auto-triggered search_vector update)

Key design decisions:

Worker Thread isolation: XML parsing runs in a separate V8 isolate so the HTTP server stays responsive during ingestion.
SAX streaming: Constant ~50 MB memory usage regardless of file size (vs ~3 GB for DOM parsing).
Batch upserts: INSERT ON CONFLICT MERGE makes re-runs idempotent — same file can be ingested twice without duplicates.
Parameter chunking: Batches are split into sub-batches of ~4,681 rows to stay under PostgreSQL's 65,535 bind parameter limit.

Scalability Path

The architecture is designed to scale to 20+ ingested XML files (12+ GB total, 16M+ rows):

Concern	Current	Scale Strategy
Search latency	GIN on `search_vector` (optimized technique)	Add `pg_partman` partitioning; optional pg_trgm for typo tolerance
Connection pressure	10 connections/worker	PgBouncer connection pooler in front of PostgreSQL
Read throughput	Cluster module (N workers)	Read replicas + load balancer
Write throughput	Chunked upserts	`COPY` protocol for initial loads, upserts for incremental
AI search	Interface stub	Plug in SQLCoder or OpenAI wrapper via `ITextToSqlEngine`
Caching	None (fast enough for prototype)	Redis with cache-aside pattern on hot search queries
Monitoring	Pino JSON logs	Prometheus metrics + Grafana dashboards

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.husky		.husky
src		src
.env.example		.env.example
.gitignore		.gitignore
.prettierrc		.prettierrc
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
eslint.config.mjs		eslint.config.mjs
jest.config.js		jest.config.js
knexfile.ts		knexfile.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vantage — High-Performance B2B Company Search Directory

Highlights

Architecture

Design Patterns Used

Tech Stack

Database & Search

Schema (migrations 001–003)

Two search techniques

Configurable candidate cap

Project Structure

Getting Started

Prerequisites

Setup

Available Scripts

API Endpoints

Search Businesses

Lookup by ABN

Trigger Ingestion

Health Check

ETL Pipeline

Scalability Path

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vantage — High-Performance B2B Company Search Directory

Highlights

Architecture

Design Patterns Used

Tech Stack

Database & Search

Schema (migrations 001–003)

Two search techniques

Configurable candidate cap

Project Structure

Getting Started

Prerequisites

Setup

Available Scripts

API Endpoints

Search Businesses

Lookup by ABN

Trigger Ingestion

Health Check

ETL Pipeline

Scalability Path

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages