A complete local microservices learning platform that simulates a Groupon-like "Deal Experience Platform" and focuses on production Node.js concerns:
- Concurrency control
- Backpressure
- Observability (logs, metrics, tracing)
- SLO/SLI/SLA
- Event loop blocking vs worker threads
- Resilience patterns (timeouts, retries, degradation)
- Graceful shutdown
- Queue processing at scale
All application code is JavaScript for Node 20+.
experience-api(localhost:3000): aggregator and orchestration layerdeal-service(localhost:3001): base catalog dataprice-service(localhost:3002): pricing and discountinginventory-service(localhost:3003): availabilityrating-lb(localhost:3014): load balancer for rating replicasrating-service-a(localhost:3004): rating replica Arating-service-b(localhost:3006): rating replica Bmerchant-service(localhost:3005): merchant metadataslo-dashboard(localhost:3100): parses Prometheus text metrics and evaluates SLOs
enrichment-worker: high-volume enrichment jobs with retries and DLQcritical-worker: purchase workflow jobs (fraud check, analytics, email) with retries and DLQ
redis(localhost:6379): BullMQ backend + read-model/cache store for hybrid list readsredis-commander(localhost:8081): Redis queue inspection UI
GET /api/deals?city=valencia&limit=20GET /api/deals/:idPOST /api/purchasePOST /api/enrichment-jobsPOST /api/heavy-hash/badPOST /api/heavy-hash/goodGET /healthzGET /readyzGET /metrics
- Node.js 20+
- Docker + Docker Compose
npm installnpm run devThis launches all services, workers, and dashboard together.
docker compose up --buildnpm run load:dealsnpm run load:enqueuenpm run load:mixednode tools/load.js --target=http://localhost:3000 --duration=30 --concurrency=50 --mode=mixedModes:
dealsenqueuehash-badhash-goodmixed
CLI output includes:
- Requests sent
- Success/failures
- RPS
- Avg latency
- p95 latency estimate
The dashboard polls every 2 seconds and drives four live views:
/api/sli: SLI/SLO cards and trend charts/api/topology: service graph, edge stats, ingress rates/api/scenarios: scenario activation state + explanations/api/hybrid: hybrid read-model effectiveness and fan-out reduction
It shows:
- p95 latency
- error rate
- queue depth
- queue lag estimate
- upstream failure rate
- request rate
It also evaluates SLOs:
- p95 latency
< 300ms - error rate
< 1% - queue depth
< 1000
Scenario buttons trigger live experiments:
- Normal traffic (default)
- Spike traffic
- Chaotic scenario
- High throughput (max load)
- Enqueue 10k jobs
- Trigger blocking hash
- Trigger worker-thread hash
requestRateRps = Δ(experience_http_request_duration_seconds_count) / ΔterrorRatePct = (Δ(experience_http_request_errors_total) / Δ(request_count)) * 100p95LatencyMs = histogramQuantile(0.95, experience_http_request_duration_seconds_bucket) * 1000upstreamFailureRatePct = (Δ(experience_upstream_failures_total) / Δ(experience_upstream_requests_total)) * 100queueDepth = enrichment_depth + critical_depthqueueLagEstimateSec = queueDepth / queueCompletionRatePerSecond(orInfinityif depth > 0 and completion rate is 0)
- Left-to-right graph:
clients -> experience-api -> downstream services/queues/workers - Edge labels include live RPS, p95, failure rate, queue depth/lag depending on edge type
- Color modes:
- Auto severity colors
- Manual palette override with persistence
- You can drag boxes to customize layout; positions persist in browser storage
Redis has two roles:
- Queue state backend for BullMQ (
enrichment-jobs,critical-jobs) - Read-model/cache for hybrid list responses in
experience-api
Graph details:
experience-api -> redis: cache/read-model usageredis -> enrichment-queueandredis -> critical-queue: queue backend relationship- Redis node now shows cache metrics:
- read-model hits/misses and hit %
- base catalog cache hits/misses and hit %
When high-throughput sends many GET /api/deals requests, most are served from Redis read-model/cache. That keeps downstream fan-out traffic low even if ingress RPS is very high.
To force downstream fan-out, use scenarios that generate more /api/deals/:id traffic (detail path) or combined mixed load.
Normal traffic: steady baseline- Look at stable low error rate, moderate p95, small queue depth
Spike traffic: short read bursts- Look for p95 jumps, upstream retries/timeouts, temporary degradations
Chaotic scenario: mixed unpredictable pressure- Look for simultaneous movement in latency, errors, queue depth, and worker lag
High throughput (max load): sustained ingress stress- Look for high client ingress and where bottlenecks appear
Enqueue 10k jobs: queue pressure test- Look for depth growth, lag growth, and eventual 429 admission control
Trigger blocking hash: event-loop blocking- Look for API latency spikes and degraded responsiveness
Trigger worker-thread hash: CPU offload control test- Compare with blocking hash to see lower event-loop impact
| Component | Purpose | What to observe |
|---|---|---|
experience-api |
Aggregates data, applies retries/timeouts/degradation, enqueues jobs | Ingress RPS, p95, error rate, warning/degraded responses |
deal-service |
Source of base deal catalog | Upstream request rate and latency under list/detail load |
price-service |
Computes discounted pricing | Added latency in fan-out and failure contribution to degraded responses |
inventory-service |
Availability checks | Behavior during spikes and impact on purchase/detail quality |
rating-lb |
Distributes rating traffic between replicas | Per-target distribution and target health/error changes |
rating-service-a / rating-service-b |
Rating replicas with independent health and errors | Load-balancing spread, per-target failure rate, health flips |
merchant-service |
Merchant metadata enrichment | Upstream call rate during detail fan-out |
enrichment-worker |
Processes enrichment queue with retries/DLQ semantics | Completion rate, queue lag, failure rate |
critical-worker |
Purchase workflow pipeline (fraud/analytics/email) | Critical queue lag, processing stability under load |
redis |
Queue backend + read-model/cache store | Queue depth, cache hit rates, readiness |
slo-dashboard |
Learning UI for metrics + scenarios + topology | Correlated behavior across latency/errors/queues/fan-out |
experience-api uses p-limit to cap upstream fan-out calls and avoid Promise.all explosion.
All upstream GET requests use undici with explicit timeout (800ms default) and one retry with jitter.
If non-critical upstreams fail, responses still succeed with warnings:
{
"deals": [],
"warnings": ["rating-lb timeout"],
"meta": { "requestId": "..." }
}POST /api/enrichment-jobs applies admission control. If queue depth crosses threshold, API returns 429.
POST /api/heavy-hash/badruns CPU work on the main threadPOST /api/heavy-hash/goodoffloads the same work toworker_threads
- Structured logs via
pino+pino-http - Correlation IDs (
x-correlation-id) across services - Prometheus metrics via
prom-client - OpenTelemetry SDK Node + auto instrumentation
/healthzfor liveness/readyzfor readiness (redis+ upstream health checks)- Graceful shutdown on
SIGINT/SIGTERM
This project includes package-lock.json and uses it to keep installs deterministic over time.
Why lockfiles matter:
- They pin transitive versions for reproducible installs
- They reduce "works on my machine" drift
- They improve incident debugging by preserving dependency history
Recommended install in CI/CD and containers:
npm ci- Observe degraded responses when one upstream fails or times out
- Watch request latency and error counters in
/metrics - Force queue overload and verify
429admission control - Compare
/api/heavy-hash/badvs/api/heavy-hash/goodunder load - Trigger purchase jobs and inspect retries + DLQ behavior
- Use Redis Commander to inspect queue depth and stuck jobs
- Validate
/readyzbehavior when a dependency is down - Stop services with
Ctrl+Cand confirm graceful shutdown logs
flowchart LR
client[Client] --> experience[Experience API]
experience --> redis[(Redis read model/cache)]
experience --> deal[deal-service]
experience --> price[price-service]
experience --> inventory[inventory-service]
experience --> ratinglb[rating-lb]
ratinglb --> ratinga[rating-service-a]
ratinglb --> ratingb[rating-service-b]
experience --> merchant[merchant-service]
experience --> response[Aggregated Deal Response with Warnings]
flowchart LR
client[Client] --> enqueue[POST /api/enrichment-jobs]
enqueue --> depth{Queue depth > threshold?}
depth -->|Yes| reject[429 Backpressure]
depth -->|No| queue[Redis Queue]
queue --> worker[enrichment-worker]
worker --> done[Completed]
worker --> retries[Retries with exponential backoff]
retries --> dlq[DLQ after attempts exhausted]
flowchart LR
purchase[POST /api/purchase] --> queue[critical-jobs queue]
queue --> fraud[fraud-check simulation]
fraud --> analytics[analytics event simulation]
analytics --> email[email receipt simulation]
email --> success[Purchase workflow complete]
fraud --> retry[Retry/backoff]
analytics --> retry
email --> retry
retry --> criticalDlq[critical-dlq]
flowchart LR
req[API request] --> bad[heavy-hash/bad main thread]
req --> good[heavy-hash/good worker thread]
bad --> block[Event loop blocked]
good --> worker[Worker thread CPU execution]
worker --> responsive[Event loop remains responsive]
MIT. See LICENSE.
