High-Quality Procedural Seed Data for Development Environment
Overview
Introduce a procedural seed data generation system for Finish Line's development environment. The goal is to replace the existing single-file, manually-defined seed data with a scalable, distribution-aware generation infrastructure that produces realistic, temporally-consistent data across multiple cars and years. This will allow the team to stress-test features before production, surface bugs earlier (both frontend and backend), and support development of multi-year and multi-car aggregation features.
Stakeholders
Software Stakeholder: @chpy04 @wavehassman
Reference Users: @chpy04 FinishLine Developers
User Story
As a FinishLine developer, I want a realistic, large-scale seed dataset that is procedurally generated with configurable distributions, so that I can catch UI, performance, and logic bugs in the development environment before they reach production, and build multi-year features against representative data.
As a FinishLine developer, I want a consistent infrastructure & established patterns for creating procedurally generated seed data with custom distributions, so that I can create high-quality and easily modifiable seed data for features I am developing.
Success Metrics
- Seed data generation completes in a reasonable amount of time (<1 minute)
- Developers no longer need to pull from the production database to stress-test features
- Frontend scroll/pagination bugs are catchable in dev due to realistic data volumes
- Backend query performance issues (e.g., over-fetching) are surfaced before production
- Multi-year and multi-car aggregation features can be developed and tested entirely in dev
- All generated data is in a valid application state — no impossible object configurations exist in the seed
Rollout Plan
- Ship as a development-only feature; no production impact
- The new seed script replaces the existing monolithic seed file
- Will be introduced into develop incrementally with a separate second seed function to start
- When finished, will be swaped into
yarn prisma:reset, and old seed data removed
Out of Scope
- Multi-tenant seed data (nice-to-have, not required for V1)
- Automated seeding in CI/CD pipelines
- Seeding production or staging environments
- Using seed infrastructure in testing pipeline
Background / Context
The current seed data setup calls service functions directly in a single file, producing a small, static dataset. This has caused real issues:
- Frontend components have only been tested with ~4 objects, masking scroll/pagination bugs until production
- Backend query arg issues (over-fetching) haven't been caught early enough
- New features that aggregate data across multiple years and cars (currently in development) have no seed data to develop against
The new system needs to be distribution-aware. Not just "more data," but realistic data. Dates must be internally consistent, object states must be reachable through normal application flow, and numeric distributions (e.g., task completion ratios, bond item counts) should reflect plausible real-world usage.
Acceptance Criteria
Infrastructure
Cars & Users
Projects & Work Packages
Current Year Car
BOM Items & Reimbursement Requests
Date Consistency
Other objects
We want seed data for all objects. Numbers for the most relevant objects are provided here, but use best judgement to make realistic data for other objects. Reach out to @wavehassman or @chpy04 when unsure on distributions
Tickets
High-Quality Procedural Seed Data for Development Environment
Overview
Introduce a procedural seed data generation system for Finish Line's development environment. The goal is to replace the existing single-file, manually-defined seed data with a scalable, distribution-aware generation infrastructure that produces realistic, temporally-consistent data across multiple cars and years. This will allow the team to stress-test features before production, surface bugs earlier (both frontend and backend), and support development of multi-year and multi-car aggregation features.
Stakeholders
Software Stakeholder: @chpy04 @wavehassman
Reference Users: @chpy04 FinishLine Developers
User Story
As a FinishLine developer, I want a realistic, large-scale seed dataset that is procedurally generated with configurable distributions, so that I can catch UI, performance, and logic bugs in the development environment before they reach production, and build multi-year features against representative data.
As a FinishLine developer, I want a consistent infrastructure & established patterns for creating procedurally generated seed data with custom distributions, so that I can create high-quality and easily modifiable seed data for features I am developing.
Success Metrics
Rollout Plan
yarn prisma:reset, and old seed data removedOut of Scope
Background / Context
The current seed data setup calls service functions directly in a single file, producing a small, static dataset. This has caused real issues:
The new system needs to be distribution-aware. Not just "more data," but realistic data. Dates must be internally consistent, object states must be reachable through normal application flow, and numeric distributions (e.g., task completion ratios, bond item counts) should reflect plausible real-world usage.
Acceptance Criteria
Infrastructure
Cars & Users
Projects & Work Packages
Current Year Car
BOM Items & Reimbursement Requests
Date Consistency
Other objects
We want seed data for all objects. Numbers for the most relevant objects are provided here, but use best judgement to make realistic data for other objects. Reach out to @wavehassman or @chpy04 when unsure on distributions
Tickets