Skip to content

opendsr-std/seedfaker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seedfaker

Deterministic synthetic data generator. Same seed, same output — across CLI, Python, Node.js, Go, PHP, Ruby, WASM.

200+ fields, 68 locales, multi-table FK, expressions, templates, streaming, replace for anonymising existing data.

Highlights

  • Deterministic across 7 runtimes — CLI, Python, Node, Go, PHP, Ruby, WASM. Same seed → byte-identical bytes. --fingerprint catches algorithm drift.
  • Multi-table FK — anchors (users.id:zipf), dereference (customer_id->email), self-reference, ctx:strict identity correlation.
  • Distributed--shard I/N on three hosts, concatenate, bit-identical to single-host. No coordinator.
  • Database ingestseedfaker | psql "\COPY", no files, constant memory.
  • TB-scale — 1 GB into Postgres in 9 s on 8-core (benchmark); 1 TB ≈ 4.3 h.
  • Throughput — ~90 MB/s per core (TPC-H dbgen parity), 403 MB/s on 8 threads. Reproducible in benchmarks/.
  • In-place anonymisationseedfaker replace email ssn < dump.csv. Same value + seed = same replacement; cross-file joins survive.
  • ML/LLM datasets--annotated (byte-offset spans), --corrupt (15 noise types), templates (prompt/completion), multi-table FK (conversations, RAG).
  • Locale-aware PII — Luhn credit cards, IBAN check digits, 48 gov-ID formats, 68 locales, native scripts.

Contents

Install

One of:

pip install seedfaker                          # Python
npm install @opendsr/seedfaker                 # Node.js
go get github.com/opendsr-std/seedfaker-go     # Go
composer require opendsr/seedfaker             # PHP
gem install seedfaker                          # Ruby
npm install @opendsr/seedfaker-wasm            # Browser (WASM)
brew install opendsr-std/tap/seedfaker         # CLI (macOS / Linux)
cargo install seedfaker                        # CLI (from source)
npm install -g @opendsr/seedfaker-cli          # CLI (npm)

All packages wrap the same Rust core and produce byte-identical output for a given seed. See Packages and bindings for per-package documentation.

Library

One value:

from seedfaker import SeedFaker
sf = SeedFaker(seed="test")
sf.field("email")                  # "janet.marsh@inbox.com"
sf.field("phone", e164=True)       # "+14155551234"
sf.field("credit-card", space=True) # "4174 0785 8323 6433"

One record, with ctx="strict" locking every field to one identity:

sf.record(["name", "email", "phone"], ctx="strict")
# {"name": "Janet Marsh", "email": "janet.marsh@inbox.com", "phone": "+1 (957) 226-4272"}

Batch:

sf.records(["name", "email", "phone"], n=1000, ctx="strict")

Locales, weighted mix, native script:

SeedFaker(seed="test", locale="de").field("name")        # "Baldur Adler"
SeedFaker(seed="test", locale="ja").field("name")        # "石本 和彦"
SeedFaker(seed="test", locale="en=7,de=2,fr=1")          # weighted

Node.js API is identical:

const sf = new SeedFaker({ seed: "test", locale: "en" });
sf.records(["name", "email"], { n: 1000, ctx: "strict" });

Full API: docs/library. Locale list: docs/context.

CLI

seedfaker name email phone --seed test --until 2025 -n 1000
seedfaker name email phone --format csv --seed test --until 2025 -n 1000
seedfaker name email phone --format jsonl --seed test --until 2025 -n 1000
seedfaker name email --ctx strict -l ja,zh --abc native -n 10

Pipe directly into a database:

seedfaker name email phone --format sql=users -n 1000000 --seed staging --until 2025 | psql mydb

Arithmetic between columns:

seedfaker price=amount:1..500:plain qty=integer:1..20 "total=price*qty" \
  --format csv --seed ci -n 3 --until 2025
# price,qty,total
# 424.49,14,5942.86
# 459.67,3,1379.01
# 309.44,12,3713.28

Presets for common log/data shapes:

seedfaker run nginx   --rate 5000 --seed demo -n 0 > access.log
seedfaker run payment --format jsonl --seed bench -n 1000 --until 2025

All flags: docs/cli. Field syntax: docs/fields. Configs: docs/configs. Presets: docs/presets.

Multi-table and FK

# shop.yaml
users:
  columns:
    id: serial
    name: first-name
    email: email
  options: { count: 1000, ctx: strict }

orders:
  columns:
    id: serial
    customer_id: users.id:zipf
    customer_name: customer_id->name
    customer_email: customer_id->email
    total: amount:usd:1..5000
  options: { count: 50000 }
seedfaker run shop.yaml --all --output-dir ./data --format csv
  • users.id:zipf — FK anchor with power-law distribution. :zipf=N for tunable exponent; omit for uniform.
  • customer_id->email — FK dereference; resolves to the email of the same parent row selected by customer_id.
  • Self-referencing FK supported (employees.manager_id: employees.id).

Details: docs/multi-table, docs/expressions.

For bulk-loading a real database at GB/TB scale see guides/seed-large-database.

Distributed generation

Determinism enables horizontal scale without coordination. --shard I/N emits a disjoint, contiguous slice of the full serial range; the same seed on different hosts produces non-overlapping output. Concatenating all N shards (first shard's header retained, rest with --no-header) yields bytes bit-identical to an N=1 run.

Three hosts, one dataset:

# host-a
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
  --shard 0/3 --format csv > events.part0.csv

# host-b
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
  --shard 1/3 --format csv --no-header > events.part1.csv

# host-c
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
  --shard 2/3 --format csv --no-header > events.part2.csv

Collect and concatenate:

cat events.part0.csv events.part1.csv events.part2.csv > events.csv
# Same bytes, same SHA-256 as:
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 --format csv

No shared state between hosts. No coordinator. No post-processing merge step. Each host is CPU-bound on its own slice and finishes independently.

Per-host generation can also use --threads N on top of --shard, stacking process and in-process parallelism:

seedfaker ... --shard 0/3 --threads 8 --format csv > events.part0.csv

Details on which mechanism to pick and how they compose: docs/cli § Sharding and threads, guides/seed-large-database.

Bulk load into a database

Pipe generated CSV straight into COPY FROM STDIN — no intermediate files, constant memory:

seedfaker run shop.yaml --table users --format csv \
  | psql "$PGURL" -q -c "\COPY users (id,name,email) FROM STDIN WITH (FORMAT csv, HEADER true)"

For GB/TB-scale loads: strip all constraints during phase 1, add them back afterwards.

CREATE UNLOGGED TABLE users (id UUID NOT NULL, name TEXT, email TEXT);
-- load rows with COPY FROM STDIN (no PK, no FK, no indexes)
ALTER TABLE users SET LOGGED;
ALTER TABLE users ADD PRIMARY KEY (id);

Reason: Postgres constraint and index maintenance is per-row during INSERT/COPY; deferring to a single post-load scan is dramatically faster. seedfaker guarantees id uniqueness by construction, so phase-1 validation is wasted work.

--shard I/N splits one table's generation into N disjoint serial ranges. Run multiple seedfaker | psql pipelines in parallel into the same table — Postgres takes a RowExclusive lock per backend, not Exclusive, so concurrent COPY into one table is supported.

# 4 shards into the same table, concurrent
for i in 0 1 2 3; do
  seedfaker run shop.yaml --table events --format csv --shard $i/4 \
    | psql "$PGURL" -q -c "\COPY events (id,ts,user_id) FROM STDIN WITH (FORMAT csv, HEADER true)" &
done
wait

The reference benchmark benchmarks/payments_5gb.sh implements this pattern end-to-end: 10-table payment dataset, Dockerised Postgres 17 with tuned settings, per-table shard pool, Postgres-side WAL / checkpoint / cache-hit counters.

./benchmarks/payments_5gb.sh                       # ~100 MB, default
./benchmarks/payments_5gb.sh --scale 50 --shards 3 # ~5 GB with 3-way sharding of the big tables
./benchmarks/payments_5gb.sh --cleanup

Full workflow, tuning rationale, per-knob cost table, cross-engine notes (MySQL, ClickHouse, SQLite): guides/seed-large-database.

Anonymise existing data

Replace specific columns in existing CSV or JSONL, keeping other columns untouched and preserving referential integrity across files:

$ echo 'name,email,ssn
Alice,alice@corp.com,123-45-6789' | seedfaker replace email ssn --seed anon
name,email,ssn
Alice,nolan.moreno.xxy@icloud.com,404-16-7659

Same value + same seed yields the same replacement every run, so joining users.email and events.email (after masking each independently) still matches. Details: docs/replace.

Annotated output for ML

--annotated emits JSONL with byte-offset spans, suitable for NER / PII training sets:

$ seedfaker name email ssn --annotated --seed demo -n 1 --until 2025
{"text":"Paulina Laca\tim.ivana@eunet.rs\t9580255797203","spans":[{"s":0,"e":12,"f":"name","v":"Paulina Laca"},{"s":13,"e":30,"f":"email","v":"im.ivana@eunet.rs"},{"s":31,"e":44,"f":"ssn","v":"9580255797203"}]}

Combine with --corrupt low|mid|high|extreme for noisy training data. Details: docs/annotated, docs/corruption.

Determinism

Each value is derived from (seed, record_number, field_name). Consequences:

  • Adding a field does not change values of existing fields.
  • Reordering fields in the config does not change values.
  • The same seed produces byte-identical output across languages and versions within the same algorithm fingerprint.

Pin the fingerprint in CI to detect algorithm changes:

seedfaker --fingerprint
# sf0-158dc9f79ce46b43

Details: docs/determinism, docs/context (identity correlation).

Packages and bindings

Language / runtime Package Local docs
Python pip install seedfaker packages/pip
Node.js npm install @opendsr/seedfaker packages/npm
Go go get github.com/opendsr-std/seedfaker-go packages/go
PHP composer require opendsr/seedfaker packages/php
Ruby gem install seedfaker packages/ruby
Browser (WASM) npm install @opendsr/seedfaker-wasm packages/wasm
CLI (npm) npm install -g @opendsr/seedfaker-cli packages/npm-cli
CLI (native) brew install opendsr-std/tap/seedfaker or cargo install seedfaker docs/cli

All packages wrap the same Rust core. API surface is intentionally identical across languages except for idiomatic naming.

Documentation

Reference: docs/.

Start here Quick start
CLI Commands and flags · Determinism
Fields Syntax and modifiers · Field reference (200+)
Configs YAML configs · Multi-table · Expressions
Output Templates · Annotated · Streaming
Data quality Context · Corruption · Replace
Presets Built-in presets (nginx, payment, auth, postgres, syslog, medical, …)
Integrations Library API · MCP

Workflows: guides/. Runnable examples: examples/.

Quick start

pip install seedfaker
python -c 'from seedfaker import SeedFaker; print(SeedFaker(seed="demo").record(["name","email"]))'

Or with the CLI:

brew install opendsr-std/tap/seedfaker
seedfaker name email phone --seed demo --until 2025 -n 5

Then: docs/quick-start for the 10-minute walkthrough, docs/cli for flags, docs/fields for field syntax.

Guides

End-to-end workflows in guides/:

Seed a database Postgres/MySQL staging DB with multi-table FK
Seed a large database GB/TB bulk load — parallel COPY, UNLOGGED, tuning
Distributed generation Multi-host sharded generation without coordination
Anonymise production data replace on CSV/JSONL, FK integrity across files
Training and evaluation datasets NER/PII, LLM fine-tuning, eval with ground truth, red-team, multilingual
Reproducible datasets Deterministic fixtures, CI, fingerprint guard
Library usage Python / Node.js SDK patterns
Mock API server Express / FastAPI mock endpoint
API load testing Rate-limited streaming, corruption
MCP for AI agents Claude / Cursor / VS Code integration

Benchmarks

Reproducible throughput measurements, install scripts, per-field breakdowns, and an end-to-end Postgres load benchmark (payments_5gb.sh): benchmarks/.

License

MIT


README · Docs · Guides · Packages

About

Deterministic synthetic data generator for realistic, correlated, and noisy test records across 68 locales. Rust CLI/Python/Node.js/Browser WASM/Go/PHP/Ruby/MCP

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages