Skip to content

[Seed Data] - Seed data revamp #4199

@chpy04

Description

@chpy04

High-Quality Procedural Seed Data for Development Environment

Overview

Introduce a procedural seed data generation system for Finish Line's development environment. The goal is to replace the existing single-file, manually-defined seed data with a scalable, distribution-aware generation infrastructure that produces realistic, temporally-consistent data across multiple cars and years. This will allow the team to stress-test features before production, surface bugs earlier (both frontend and backend), and support development of multi-year and multi-car aggregation features.

Stakeholders

Software Stakeholder: @chpy04 @wavehassman
Reference Users: @chpy04 FinishLine Developers

User Story

As a FinishLine developer, I want a realistic, large-scale seed dataset that is procedurally generated with configurable distributions, so that I can catch UI, performance, and logic bugs in the development environment before they reach production, and build multi-year features against representative data.
As a FinishLine developer, I want a consistent infrastructure & established patterns for creating procedurally generated seed data with custom distributions, so that I can create high-quality and easily modifiable seed data for features I am developing.

Success Metrics

  • Seed data generation completes in a reasonable amount of time (<1 minute)
  • Developers no longer need to pull from the production database to stress-test features
  • Frontend scroll/pagination bugs are catchable in dev due to realistic data volumes
  • Backend query performance issues (e.g., over-fetching) are surfaced before production
  • Multi-year and multi-car aggregation features can be developed and tested entirely in dev
  • All generated data is in a valid application state — no impossible object configurations exist in the seed

Rollout Plan

  • Ship as a development-only feature; no production impact
  • The new seed script replaces the existing monolithic seed file
  • Will be introduced into develop incrementally with a separate second seed function to start
  • When finished, will be swaped into yarn prisma:reset, and old seed data removed

Out of Scope

  • Multi-tenant seed data (nice-to-have, not required for V1)
  • Automated seeding in CI/CD pipelines
  • Seeding production or staging environments
  • Using seed infrastructure in testing pipeline

Background / Context

The current seed data setup calls service functions directly in a single file, producing a small, static dataset. This has caused real issues:

  • Frontend components have only been tested with ~4 objects, masking scroll/pagination bugs until production
  • Backend query arg issues (over-fetching) haven't been caught early enough
  • New features that aggregate data across multiple years and cars (currently in development) have no seed data to develop against

The new system needs to be distribution-aware. Not just "more data," but realistic data. Dates must be internally consistent, object states must be reachable through normal application flow, and numeric distributions (e.g., task completion ratios, bond item counts) should reflect plausible real-world usage.


Acceptance Criteria

Infrastructure

  • A fake-data library is integrated (e.g., Faker.js for TypeScript) that supports configurable value distributions
  • Each seedable domain object has a dedicated factory function that accepts optional overrides and returns a valid, self-consistent object
  • Data for prominent objects such as project names and team names should pull randomly from static lists of strings to produce NER-like data
  • The seed script loops over factory functions rather than defining every object manually
  • If direct Prisma calls are used instead of service functions, all generated states must be reachable through normal service function flows (e.g., no unapproved stage gate change requests)
  • If service functions are used, concurrent call volume must not cause server instability during seeding

Cars & Users

  • 4–5 cars are generated, each representing a distinct year
  • 300–400 total users are generated with the following role distribution:
    • ~50% guests
    • ~35% members (of the non-guest 50%)
    • ~10% leads
    • ~4% heads
    • ~1% admins
  • Factory functions exist for each user pipeline stage: guest, onboarding, new member, member, lead, head/admin
  • 20 teams are generated and carry over across cars

Projects & Work Packages

  • Each car has ~30 projects
  • Each project has a randomly assigned duration between 3 and 12 months, with dates relative to its car's year (note that projects do not have dates, but this should be the summation of their work packages)
  • Each project has a variable number of work packages (0–8, average ~5), with date ranges that:
    • Collectively span the project's full timeline
    • Are sequentially ordered and non-overlapping where blocking relationships exist
  • A random distribution of work packages are marked as blocking others; blocked work packages have start dates that respect the blocker's end date
  • Each project has 0–80 tasks, with assignments and due dates scoped to the project timeline
  • Task completion status is distributed by timeline position:
    • Tasks with past due dates → mostly Done
    • Tasks with future due dates → mostly Backlog
    • Tasks with due dates near the current date → mostly In Progress
  • Tasks can optionally be assigned to work packages, respecting work package ordering

Current Year Car

  • The current-year car is modeled as being approximately halfway through its year
  • ~50% of BOM items are attributed to the past half of the year and marked as purchased
  • Tasks in the current-year car reflect the halfway-through state: past work packages are mostly complete or in review, future ones are mostly in backlog/not started (should have some wiggle room in this)

BOM Items & Reimbursement Requests

  • Each project has 0–200 BOM items, with most projects falling in the 0–30 range and a long tail of outliers
  • For past-year cars: ~60% BOM items are tied to a reimbursement request
  • For the current-year car: ~30% of BOM items are tied to a reimbursement request
  • Reimbursement requests have a 70/30 split: BOM items vs. general supplies
  • Total budget per car targets ~$80K, distributed proportionally across its projects

Date Consistency

  • All object dates (projects, work packages, tasks, bond items, calendar events) are generated relative to their car's year, not the current wall-clock date
  • No object has a date that is inconsistent with its parent object's date range

Other objects

We want seed data for all objects. Numbers for the most relevant objects are provided here, but use best judgement to make realistic data for other objects. Reach out to @wavehassman or @chpy04 when unsure on distributions

Tickets

  • #
  • #

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicA feature that will take many tickets to complete

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions