seedfaker

Deterministic synthetic data generator. Same seed, same output — across CLI, Python, Node.js, Go, PHP, Ruby, WASM.

200+ fields, 68 locales, multi-table FK, expressions, templates, streaming, replace for anonymising existing data.

Highlights

Deterministic across 7 runtimes — CLI, Python, Node, Go, PHP, Ruby, WASM. Same seed → byte-identical bytes. --fingerprint catches algorithm drift. →
Multi-table FK — anchors (users.id:zipf), dereference (customer_id->email), self-reference, ctx:strict identity correlation. →
Distributed — --shard I/N on three hosts, concatenate, bit-identical to single-host. No coordinator. →
Database ingest — seedfaker | psql "\COPY", no files, constant memory. →
TB-scale — 1 GB into Postgres in 9 s on 8-core (benchmark); 1 TB ≈ 4.3 h. →
Throughput — ~90 MB/s per core (TPC-H dbgen parity), 403 MB/s on 8 threads. Reproducible in benchmarks/.
In-place anonymisation — seedfaker replace email ssn < dump.csv. Same value + seed = same replacement; cross-file joins survive. →
ML/LLM datasets — --annotated (byte-offset spans), --corrupt (15 noise types), templates (prompt/completion), multi-table FK (conversations, RAG). →
Locale-aware PII — Luhn credit cards, IBAN check digits, 48 gov-ID formats, 68 locales, native scripts. →

Install

One of:

pip install seedfaker                          # Python
npm install @opendsr/seedfaker                 # Node.js
go get github.com/opendsr-std/seedfaker-go     # Go
composer require opendsr/seedfaker             # PHP
gem install seedfaker                          # Ruby
npm install @opendsr/seedfaker-wasm            # Browser (WASM)
brew install opendsr-std/tap/seedfaker         # CLI (macOS / Linux)
cargo install seedfaker                        # CLI (from source)
npm install -g @opendsr/seedfaker-cli          # CLI (npm)

All packages wrap the same Rust core and produce byte-identical output for a given seed. See Packages and bindings for per-package documentation.

Library

One value:

from seedfaker import SeedFaker
sf = SeedFaker(seed="test")
sf.field("email")                  # "janet.marsh@inbox.com"
sf.field("phone", e164=True)       # "+14155551234"
sf.field("credit-card", space=True) # "4174 0785 8323 6433"

One record, with ctx="strict" locking every field to one identity:

sf.record(["name", "email", "phone"], ctx="strict")
# {"name": "Janet Marsh", "email": "janet.marsh@inbox.com", "phone": "+1 (957) 226-4272"}

Batch:

sf.records(["name", "email", "phone"], n=1000, ctx="strict")

Locales, weighted mix, native script:

SeedFaker(seed="test", locale="de").field("name")        # "Baldur Adler"
SeedFaker(seed="test", locale="ja").field("name")        # "石本 和彦"
SeedFaker(seed="test", locale="en=7,de=2,fr=1")          # weighted

Node.js API is identical:

const sf = new SeedFaker({ seed: "test", locale: "en" });
sf.records(["name", "email"], { n: 1000, ctx: "strict" });

Full API: docs/library. Locale list: docs/context.

CLI

seedfaker name email phone --seed test --until 2025 -n 1000
seedfaker name email phone --format csv --seed test --until 2025 -n 1000
seedfaker name email phone --format jsonl --seed test --until 2025 -n 1000
seedfaker name email --ctx strict -l ja,zh --abc native -n 10

Pipe directly into a database:

seedfaker name email phone --format sql=users -n 1000000 --seed staging --until 2025 | psql mydb

Arithmetic between columns:

seedfaker price=amount:1..500:plain qty=integer:1..20 "total=price*qty" \
  --format csv --seed ci -n 3 --until 2025
# price,qty,total
# 424.49,14,5942.86
# 459.67,3,1379.01
# 309.44,12,3713.28

Presets for common log/data shapes:

seedfaker run nginx   --rate 5000 --seed demo -n 0 > access.log
seedfaker run payment --format jsonl --seed bench -n 1000 --until 2025

All flags: docs/cli. Field syntax: docs/fields. Configs: docs/configs. Presets: docs/presets.

Multi-table and FK

# shop.yaml
users:
  columns:
    id: serial
    name: first-name
    email: email
  options: { count: 1000, ctx: strict }

orders:
  columns:
    id: serial
    customer_id: users.id:zipf
    customer_name: customer_id->name
    customer_email: customer_id->email
    total: amount:usd:1..5000
  options: { count: 50000 }

seedfaker run shop.yaml --all --output-dir ./data --format csv

users.id:zipf — FK anchor with power-law distribution. :zipf=N for tunable exponent; omit for uniform.
customer_id->email — FK dereference; resolves to the email of the same parent row selected by customer_id.
Self-referencing FK supported (employees.manager_id: employees.id).

Details: docs/multi-table, docs/expressions.

For bulk-loading a real database at GB/TB scale see guides/seed-large-database.

Distributed generation

Determinism enables horizontal scale without coordination. --shard I/N emits a disjoint, contiguous slice of the full serial range; the same seed on different hosts produces non-overlapping output. Concatenating all N shards (first shard's header retained, rest with --no-header) yields bytes bit-identical to an N=1 run.

Three hosts, one dataset:

# host-a
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
  --shard 0/3 --format csv > events.part0.csv

# host-b
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
  --shard 1/3 --format csv --no-header > events.part1.csv

# host-c
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 \
  --shard 2/3 --format csv --no-header > events.part2.csv

Collect and concatenate:

cat events.part0.csv events.part1.csv events.part2.csv > events.csv
# Same bytes, same SHA-256 as:
seedfaker run shop.yaml --table events --seed prod -n 1_000_000_000 --format csv

No shared state between hosts. No coordinator. No post-processing merge step. Each host is CPU-bound on its own slice and finishes independently.

Per-host generation can also use --threads N on top of --shard, stacking process and in-process parallelism:

seedfaker ... --shard 0/3 --threads 8 --format csv > events.part0.csv

Details on which mechanism to pick and how they compose: docs/cli § Sharding and threads, guides/seed-large-database.

Bulk load into a database

Pipe generated CSV straight into COPY FROM STDIN — no intermediate files, constant memory:

seedfaker run shop.yaml --table users --format csv \
  | psql "$PGURL" -q -c "\COPY users (id,name,email) FROM STDIN WITH (FORMAT csv, HEADER true)"

For GB/TB-scale loads: strip all constraints during phase 1, add them back afterwards.

CREATE UNLOGGED TABLE users (id UUID NOT NULL, name TEXT, email TEXT);
-- load rows with COPY FROM STDIN (no PK, no FK, no indexes)
ALTER TABLE users SET LOGGED;
ALTER TABLE users ADD PRIMARY KEY (id);

Reason: Postgres constraint and index maintenance is per-row during INSERT/COPY; deferring to a single post-load scan is dramatically faster. seedfaker guarantees id uniqueness by construction, so phase-1 validation is wasted work.

--shard I/N splits one table's generation into N disjoint serial ranges. Run multiple seedfaker | psql pipelines in parallel into the same table — Postgres takes a RowExclusive lock per backend, not Exclusive, so concurrent COPY into one table is supported.

# 4 shards into the same table, concurrent
for i in 0 1 2 3; do
  seedfaker run shop.yaml --table events --format csv --shard $i/4 \
    | psql "$PGURL" -q -c "\COPY events (id,ts,user_id) FROM STDIN WITH (FORMAT csv, HEADER true)" &
done
wait

The reference benchmark benchmarks/payments_5gb.sh implements this pattern end-to-end: 10-table payment dataset, Dockerised Postgres 17 with tuned settings, per-table shard pool, Postgres-side WAL / checkpoint / cache-hit counters.

./benchmarks/payments_5gb.sh                       # ~100 MB, default
./benchmarks/payments_5gb.sh --scale 50 --shards 3 # ~5 GB with 3-way sharding of the big tables
./benchmarks/payments_5gb.sh --cleanup

Full workflow, tuning rationale, per-knob cost table, cross-engine notes (MySQL, ClickHouse, SQLite): guides/seed-large-database.

Anonymise existing data

Replace specific columns in existing CSV or JSONL, keeping other columns untouched and preserving referential integrity across files:

$ echo 'name,email,ssn
Alice,alice@corp.com,123-45-6789' | seedfaker replace email ssn --seed anon
name,email,ssn
Alice,nolan.moreno.xxy@icloud.com,404-16-7659

Same value + same seed yields the same replacement every run, so joining users.email and events.email (after masking each independently) still matches. Details: docs/replace.

Annotated output for ML

--annotated emits JSONL with byte-offset spans, suitable for NER / PII training sets:

$ seedfaker name email ssn --annotated --seed demo -n 1 --until 2025
{"text":"Paulina Laca\tim.ivana@eunet.rs\t9580255797203","spans":[{"s":0,"e":12,"f":"name","v":"Paulina Laca"},{"s":13,"e":30,"f":"email","v":"im.ivana@eunet.rs"},{"s":31,"e":44,"f":"ssn","v":"9580255797203"}]}

Combine with --corrupt low|mid|high|extreme for noisy training data. Details: docs/annotated, docs/corruption.

Determinism

Each value is derived from (seed, record_number, field_name). Consequences:

Adding a field does not change values of existing fields.
Reordering fields in the config does not change values.
The same seed produces byte-identical output across languages and versions within the same algorithm fingerprint.

Pin the fingerprint in CI to detect algorithm changes:

seedfaker --fingerprint
# sf0-158dc9f79ce46b43

Details: docs/determinism, docs/context (identity correlation).

Packages and bindings

Language / runtime	Package	Local docs
Python	`pip install seedfaker`	packages/pip
Node.js	`npm install @opendsr/seedfaker`	packages/npm
Go	`go get github.com/opendsr-std/seedfaker-go`	packages/go
PHP	`composer require opendsr/seedfaker`	packages/php
Ruby	`gem install seedfaker`	packages/ruby
Browser (WASM)	`npm install @opendsr/seedfaker-wasm`	packages/wasm
CLI (npm)	`npm install -g @opendsr/seedfaker-cli`	packages/npm-cli
CLI (native)	`brew install opendsr-std/tap/seedfaker` or `cargo install seedfaker`	docs/cli

All packages wrap the same Rust core. API surface is intentionally identical across languages except for idiomatic naming.

Documentation

Reference: docs/.


Start here	Quick start
CLI	Commands and flags · Determinism
Fields	Syntax and modifiers · Field reference (200+)
Configs	YAML configs · Multi-table · Expressions
Output	Templates · Annotated · Streaming
Data quality	Context · Corruption · Replace
Presets	Built-in presets (nginx, payment, auth, postgres, syslog, medical, …)
Integrations	Library API · MCP

Workflows: guides/. Runnable examples: examples/.

Quick start

pip install seedfaker
python -c 'from seedfaker import SeedFaker; print(SeedFaker(seed="demo").record(["name","email"]))'

Or with the CLI:

brew install opendsr-std/tap/seedfaker
seedfaker name email phone --seed demo --until 2025 -n 5

Then: docs/quick-start for the 10-minute walkthrough, docs/cli for flags, docs/fields for field syntax.

Guides

End-to-end workflows in guides/:


Seed a database	Postgres/MySQL staging DB with multi-table FK
Seed a large database	GB/TB bulk load — parallel COPY, UNLOGGED, tuning
Distributed generation	Multi-host sharded generation without coordination
Anonymise production data	`replace` on CSV/JSONL, FK integrity across files
Training and evaluation datasets	NER/PII, LLM fine-tuning, eval with ground truth, red-team, multilingual
Reproducible datasets	Deterministic fixtures, CI, fingerprint guard
Library usage	Python / Node.js SDK patterns
Mock API server	Express / FastAPI mock endpoint
API load testing	Rate-limited streaming, corruption
MCP for AI agents	Claude / Cursor / VS Code integration

Benchmarks

Reproducible throughput measurements, install scripts, per-field breakdowns, and an end-to-end Postgres load benchmark (payments_5gb.sh): benchmarks/.

License

MIT

README · Docs · Guides · Packages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
guides		guides
include		include
packages		packages
rust		rust
tools		tools
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.npmrc		.npmrc
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
.shellcheckrc		.shellcheckrc
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
FINGERPRINT		FINGERPRINT
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seedfaker

Highlights

Contents

Install

Library

CLI

Multi-table and FK

Distributed generation

Bulk load into a database

Anonymise existing data

Annotated output for ML

Determinism

Packages and bindings

Documentation

Quick start

Guides

Benchmarks

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

seedfaker

Highlights

Contents

Install

Library

CLI

Multi-table and FK

Distributed generation

Bulk load into a database

Anonymise existing data

Annotated output for ML

Determinism

Packages and bindings

Documentation

Quick start

Guides

Benchmarks

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages