A production-grade backend that ingests the Australian Business Register (11.5 GB XML, 20M+ businesses) and exposes a search API with full-text search (PostgreSQL tsvector + GIN), filtered queries, and optional baseline (ILIKE) for performance comparison.
Built with Node.js 24, Express 5, TypeScript 5.9, and PostgreSQL 17.
| Capability | Implementation |
|---|---|
| ~600 MB XML ingested in ~90 seconds | SAX streaming parser in a dedicated Worker Thread — zero main-thread blocking |
| Sub-50 ms search (optimized path) | PostgreSQL tsvector + GIN index on search_vector; optional native (ILIKE) for baseline (~350 ms on 9M rows) |
| Multi-process HTTP | Node.js cluster module forks one worker per CPU core for near-linear throughput scaling |
| Layered architecture | Clear layer separation (Domain, Application, Infrastructure, Interfaces) with Dependency Injection |
| AI-ready | Abstraction layer for plugging in a Text-to-SQL engine (OpenAI, SQLCoder, etc.) |
┌─────────────────────────────────────────────────────────────────┐
│ Entry Points │
│ server.ts (Cluster Primary) seed.ts (CLI Ingestion) │
├─────────────────────────────────────────────────────────────────┤
│ Interfaces Layer │
│ Routes → Controllers → Middleware (Auth, Validation, Errors) │
├─────────────────────────────────────────────────────────────────┤
│ Application Layer │
│ SearchService ← SearchStrategyFactory → Native/Optimized │
│ IngestionService (Facade) → Worker Thread │
├─────────────────────────────────────────────────────────────────┤
│ Domain Layer │
│ Business Entity │ IBusinessRepository │ ISearchStrategy │
│ IDataSourceAdapter │ ITextToSqlEngine │
├─────────────────────────────────────────────────────────────────┤
│ Infrastructure Layer │
│ PostgresBusinessRepository │ searchNative / searchOptimized │
│ Knex Migrations │ businesses, business_names, trigger + GIN │
├─────────────────────────────────────────────────────────────────┤
│ Workers Layer │
│ ETL Worker Thread → SAX Parser → XmlAdapter → BatchProcessor │
└─────────────────────────────────────────────────────────────────┘
| Pattern | Where | Why |
|---|---|---|
| Repository | IBusinessRepository / PostgresBusinessRepository |
Decouple domain logic from database engine — swap Postgres for Elasticsearch without touching services |
| Strategy | ISearchStrategy / NativeSearchStrategy / OptimizedSearchStrategy |
Swap search by request technique: native (ILIKE) vs optimized (tsvector + GIN); AI via mode (future) |
| Factory | SearchStrategyFactory |
Picks strategy from query.technique and query.mode — one new class + one case per technique |
| Adapter | XmlDataSourceAdapter |
Normalize raw ABR XML into domain entities — add a JSON adapter without changing the ETL pipeline |
| Facade | IngestionService |
Hide the complexity of worker threads, SAX parsing, and batch processing behind ingest(filePath) |
| Singleton | getDbConnection() |
One connection pool per process — in a clustered setup, each worker gets its own pool |
| Dependency Injection | tsyringe + Symbol tokens |
Wire everything in one place (container.ts) — swap implementations by changing one line |
| Category | Technology | Version |
|---|---|---|
| Runtime | Node.js | 24.x LTS |
| Framework | Express | 5.2.1 |
| Language | TypeScript | 5.9.3 (strict mode) |
| Database | PostgreSQL | 17 (via Docker) |
| Query Builder | Knex.js | 3.x |
| Validation | Zod | 4.x |
| Logging | Pino | 10.x (JSON in prod, pretty in dev) |
| XML Parsing | SAX | 1.4.x (streaming, constant memory) |
| DI Container | tsyringe | 4.x |
| Linting | ESLint 10 + typescript-eslint | Flat config |
| Formatting | Prettier | 3.x |
businesses: One row per ABN; columns includeentity_name,state,postcode, and asearch_vector(TSVECTOR) column. B-tree indexes onabn_status,entity_type_code,state,postcodespeed up filters.business_names: 1-to-many alternate names per business; FK tobusinesses, index onbusiness_id.- Migration 003: Trigger maintains
search_vectoron INSERT/UPDATE (weights: entity_name A, given/family name B, state/postcode C), one-time backfill for existing rows, and a GIN index onbusinesses(search_vector)so full-text matches use the index.
| Technique | Query param | How it works | Latency (e.g. 9M rows) |
|---|---|---|---|
| native | technique=native (default) |
entity_name ILIKE '%term%'; no index, sequential scan |
~350 ms |
| optimized | technique=optimized |
search_vector @@ to_tsquery('english', …) using the GIN index |
Sub-50 ms for selective terms |
Both use the same filters (state, postcode, entityType, abnStatus) and pagination. Filter-only requests (no q) use findWithFilters() for both techniques.
- SEARCH_MAX_CANDIDATES (env:
SEARCH_MAX_CANDIDATES, default5000): Caps the number of candidate rows used for total count and pagination. You can still request "next 100" results; the cap keeps count and data queries bounded for consistent response times.
src/
├── core/ # App bootstrap — config, logger, DI container, tokens
│ ├── config.ts # Zod-validated environment config (single source of truth)
│ ├── logger.ts # Pino structured logger
│ ├── types.ts # DI injection tokens (Symbols)
│ └── container.ts # tsyringe DI wiring
│
├── domain/ # Pure business logic — no external dependencies
│ ├── entities/
│ │ ├── Business.ts # Core entity (camelCase) + DB row (snake_case)
│ │ └── BusinessName.ts # 1-to-many trading/business names
│ └── interfaces/
│ ├── IBusinessRepository.ts # Data access contract
│ ├── ISearchStrategy.ts # Search algorithm contract
│ ├── IDataSourceAdapter.ts # Data normalisation contract
│ └── ITextToSqlEngine.ts # AI search abstraction (future)
│
├── shared/ # Cross-cutting types, errors, constants
│ ├── types.ts # SearchQuery, PaginatedResult, IngestionResult
│ ├── errors/AppError.ts # Operational vs programmer error hierarchy
│ └── constants.ts # ABR reference codes, pagination defaults
│
├── infrastructure/ # External service implementations
│ ├── database/
│ │ ├── connection.ts # Singleton connection pool
│ │ └── migrations/ # Versioned schema changes (001, 002, 003)
│ └── repositories/
│ └── PostgresBusinessRepository.ts # searchNative (ILIKE), searchOptimized (tsvector @@), findWithFilters
│
├── application/ # Use cases and orchestration
│ ├── strategies/
│ │ ├── NativeSearchStrategy.ts # ILIKE baseline
│ │ └── OptimizedSearchStrategy.ts # tsvector + GIN
│ ├── factories/
│ │ └── SearchStrategyFactory.ts
│ └── services/
│ ├── SearchService.ts # Search orchestrator
│ └── IngestionService.ts # ETL facade (spawns worker threads)
│
├── workers/ # Background processing (separate V8 isolates)
│ └── etl/
│ ├── etlWorker.ts # SAX streaming XML parser
│ ├── XmlDataSourceAdapter.ts # ABR XML → Business entity transformer
│ └── batchProcessor.ts # Chunked bulk upsert engine
│
├── interfaces/ # HTTP layer
│ └── http/
│ ├── app.ts # Express app factory
│ ├── middleware/ # Error handler, validation, request logger
│ ├── controllers/ # Thin request/response handlers
│ └── routes/ # URL → controller mapping
│
├── scripts/
│ └── seed.ts # CLI data ingestion with progress reporting
│
└── server.ts # Clustered entry point (1 worker per CPU core)
- Node.js >= 24.x
- Docker & Docker Compose (for PostgreSQL)
# 1. Clone and install
git clone https://github.com/devnadeemashraf/vantage-backend.git
cd vantage-backend
npm install
# 2. Configure environment
cp .env.example .env
# Edit .env with your database credentials
# 3. Start PostgreSQL
docker compose up -d
# 4. Run migrations
npm run migrate
# 5. Seed the database (580MB XML file in ./temp/data)
# NOTE: You have to download the dataset manually and place it in the ~/temp/data path
npm run seed -- --file ./temp/data/20260211_Public20.xml
# 6. Start the server
npm run dev| Command | Description |
|---|---|
npm run dev |
Start dev server with hot reload (tsx watch) |
npm run build |
Compile TypeScript + resolve path aliases |
npm start |
Run production build (clustered) |
npm run migrate |
Run pending database migrations |
npm run migrate:rollback |
Rollback the last migration batch |
npm run seed |
Ingest XML data via CLI with progress output |
npm run seed:migrate |
Run migrations then seed in one command |
npm run lint |
Run ESLint on src/ |
npm run format |
Format all TypeScript files with Prettier |
GET /api/v1/businesses/search?q=plumbing&state=NSW&page=1&limit=20
Response includes meta with timing for UI display:
{
"status": "success",
"data": [...],
"pagination": { "page": 1, "limit": 20, "total": 123, "totalPages": 7 },
"meta": {
"totalTimeMs": 42,
"queryTimeMs": 28
}
}totalTimeMs— wall-clock time from request arrival to response sent (ms)queryTimeMs— time spent executing database queries (ms)
Query Parameters:
| Param | Type | Description |
|---|---|---|
q |
string | Search term (matched by ILIKE or full-text depending on technique) |
state |
string | Filter by Australian state (NSW, VIC, QLD...) |
postcode |
string | Filter by postcode |
entityType |
string | Filter by entity type code (IND, PRV, PUB...) |
abnStatus |
string | Filter by ABN status (ACT, CAN) |
page |
number | Page number (default: 1) |
limit |
number | Results per page (default: 20, max: 100) |
mode |
string | Search mode: standard or ai (future) |
technique |
string | Search technique: native (ILIKE baseline) or optimized (index-backed when available). Default: native. Use to compare query performance. |
GET /api/v1/businesses/12345678901
Response includes meta with timing:
{
"status": "success",
"data": { "abn": "12345678901", "entityName": "...", ... },
"meta": { "totalTimeMs": 15, "queryTimeMs": 8 }
}POST /api/v1/ingest
Content-Type: application/json
{ "filePath": "./temp/data/20260211_Public20.xml" }
GET /api/v1/health
The ingestion pipeline processes 580 MB XML files (~500K records) in approximately 90 seconds:
XML File (580 MB)
│
▼
FileReadStream (64 KB chunks)
│
▼
SAX Parser (event-driven, constant memory)
│ opentag → closetag → text events
▼
XmlDataSourceAdapter (Adapter Pattern)
│ Raw XML → Business domain entity
▼
BatchProcessor (5000 records/batch)
│ Chunked INSERT ON CONFLICT MERGE
│ (respects PG's 65,535 parameter limit)
▼
PostgreSQL (with auto-triggered search_vector update)
Key design decisions:
- Worker Thread isolation: XML parsing runs in a separate V8 isolate so the HTTP server stays responsive during ingestion.
- SAX streaming: Constant ~50 MB memory usage regardless of file size (vs ~3 GB for DOM parsing).
- Batch upserts:
INSERT ON CONFLICT MERGEmakes re-runs idempotent — same file can be ingested twice without duplicates. - Parameter chunking: Batches are split into sub-batches of ~4,681 rows to stay under PostgreSQL's 65,535 bind parameter limit.
The architecture is designed to scale to 20+ ingested XML files (12+ GB total, 16M+ rows):
| Concern | Current | Scale Strategy |
|---|---|---|
| Search latency | GIN on search_vector (optimized technique) |
Add pg_partman partitioning; optional pg_trgm for typo tolerance |
| Connection pressure | 10 connections/worker | PgBouncer connection pooler in front of PostgreSQL |
| Read throughput | Cluster module (N workers) | Read replicas + load balancer |
| Write throughput | Chunked upserts | COPY protocol for initial loads, upserts for incremental |
| AI search | Interface stub | Plug in SQLCoder or OpenAI wrapper via ITextToSqlEngine |
| Caching | None (fast enough for prototype) | Redis with cache-aside pattern on hot search queries |
| Monitoring | Pino JSON logs | Prometheus metrics + Grafana dashboards |