Architecture Report

# Architecture Review: FRC Data Science Project

## Review Scope

This review examines the repository as an evolving analytics platform rather than as a set of isolated scripts. The focus is structure, architecture, data contracts, operational reliability, and the path from exploratory work to a repeatable pipeline.

The review was based on direct inspection of the current source tree, including:

- Configuration and runtime wiring in [config/](config/)
- API ingestion in [api/](api/)
- Relational loaders and helpers in [sql/](sql/)
- Schema history in [db/schema.sql](db/schema.sql) and [db/migrations/](db/migrations/)
- Feature engineering and modeling in [data_science/](data_science/)
- Project-level guidance in [README.md](README.md) and [requirements.txt](requirements.txt)

This was a source and architecture review. It did not execute API pulls, migrations, or model training. Editor diagnostics were clean for the inspected Python folders, but several findings below are runtime-contract or data-correctness issues that static diagnostics do not catch.

## Executive Assessment

The repository has a promising core shape:

1. External FRC data is persisted as raw JSON before transformation.
2. JSON data is normalized into relational PostgreSQL tables.
3. Feature tables are derived into a dedicated `features` schema.
4. Modeling and EDA consume the feature layer rather than scraping raw API responses directly.

That shape is sound. It separates network acquisition, durable raw data, relational storage, feature engineering, and analysis. It is already more disciplined than a notebook-only data science project.

The main architectural problem is that the project is partway between "research scripts" and "operable data platform." The layers exist, but their contracts are not yet dependable:

| Area | Current Assessment | Architectural Consequence |
| --- | --- | --- |
| Configuration | Centralized in intent, inconsistent in implementation | Canonical scripts can fail before doing useful work |
| Schema management | Migrations exist, but generated/legacy SQL and refresh SQL disagree | Feature refreshes can fail against the current schema |
| Pipeline orchestration | Ordered scripts work as a manual convention | Recovery, repeatability, and partial reruns are fragile |
| Season handling | 2026 logic is embedded through ingestion, schema, and models | Historical and future seasons require code edits across layers |
| Feature correctness | Useful temporal ideas, but at least one rolling window likely leaks current-match data | Predictive measurements can be optimistic or misleading |
| Tests and reproducibility | README exists, but it has drifted from source and dependency capture is incomplete | New contributors and future reruns will be costly |

The highest-value next move is not a large rewrite. It is to stabilize the contracts that already exist: settings, schema authority, API client behavior, feature correctness, and a minimal pipeline command surface.

## Current System Shape

### Repository Responsibilities

| Subsystem | Primary Responsibility | Representative Files |
| --- | --- | --- |
| Configuration | Environment variables, directory locations, feature column lists | [config/database.py](config/database.py), [config/directories.py](config/directories.py), [config/data_science.py](config/data_science.py) |
| API URL construction | Build FIRST and TBA endpoints | [api/frc_and_tba_api_url_builders.py](api/frc_and_tba_api_url_builders.py) |
| API fetching and raw storage | Request JSON and save it under `api/json/` | [api/pull_and_save_api_url_data.py](api/pull_and_save_api_url_data.py), [api/get_folder_file_both_from_api_url.py](api/get_folder_file_both_from_api_url.py) |
| Pull orchestration | Batch districts, events, teams, and matches; write reference JSON | [api/scripts/](api/scripts/) |
| Relational loading | Convert raw JSON into DataFrames and append to PostgreSQL | [sql/scripts/load_files/](sql/scripts/load_files/), [sql/write_df_to_sql.py](sql/write_df_to_sql.py) |
| JSON parsing | Convert TBA match payloads to relational row structures | [sql/match_data_from_json.py](sql/match_data_from_json.py) |
| Schema authority | Define public, features, and predictions database structure | [db/schema.sql](db/schema.sql), [db/migrations/](db/migrations/) |
| SQL feature refresh | Populate filtered events and denormalized match feature tables | [sql/features/load_files/](sql/features/load_files/) |
| Feature engineering | Derive phase columns and rolling autonomous scoring statistics | [data_science/data_science_functions.py](data_science/data_science_functions.py), [data_science/feature_eng/](data_science/feature_eng/) |
| Analytics consumers | EDA dashboard and experimental model scripts | [data_science/eda/test_streamlit_server.py](data_science/eda/test_streamlit_server.py), [data_science/lightgbm/first_try_light_gbm.py](data_science/lightgbm/first_try_light_gbm.py), [data_science/test_opr.py](data_science/test_opr.py) |

### Current Data Flow

```mermaid
flowchart LR
    FRC[FIRST API] --> URLS[URL builders]
    TBA[The Blue Alliance API] --> URLS
    URLS --> PULL[Pull scripts and save helpers]
    PULL --> RAW[Raw JSON landing zone\napi/json]
    RAW --> REFS[Reference JSON\nkeys and page-derived indexes]
    RAW --> PYLOAD[Numbered Python loaders\nsql/scripts/load_files]
    REFS --> PULL
    PYLOAD --> CORE[(PostgreSQL public schema)]
    CORE --> SQLFEATURES[Ordered SQL feature jobs\nsql/features/load_files]
    SQLFEATURES --> FEATURES[(PostgreSQL features schema)]
    FEATURES --> ROLLING[Python rolling feature job]
    ROLLING --> FEATURES
    FEATURES --> EDA[Streamlit EDA]
    FEATURES --> MODELS[OPR, Lasso, LightGBM experiments]
```

The raw JSON handoff is a good decision. It allows ingestion and normalization to be debugged separately. It also provides a replay boundary so transformation bugs do not force repeated network calls.

The weaker part is orchestration. Many scripts execute work at import time rather than exposing reusable functions plus an explicit `main()` or command. The order is conveyed by filenames and README instructions, not by a testable pipeline abstraction.

### Simplified Relational Shape

```mermaid
erDiagram
    DISTRICTS ||--o{ EVENTS : groups
    EVENTS ||--o{ MATCHES : hosts
    MATCHES ||--|{ MATCH_DATA_2026 : scores
    MATCHES ||--o{ MATCH_TEAMS : includes
    TEAMS ||--o{ MATCH_TEAMS : participates
    FILTERED_EVENTS ||--o{ MATCHES_EDA : filters_context
    MATCHES_EDA ||--o{ MATCH_AUTO_DATA_CALC : supports_features

    DISTRICTS {
        varchar district_key PK
        varchar district_code
        varchar district_name
    }
    EVENTS {
        varchar event_key PK
        int year
        varchar district_key FK
    }
    MATCHES {
        varchar match_key PK
        varchar event_key FK
        timestamp actual_time
    }
    MATCH_DATA_2026 {
        varchar match_key PK
        varchar alliance PK
        int auto_points
        int total_score
    }
    MATCH_TEAMS {
        varchar match_key PK
        varchar team_key PK
        varchar alliance
    }
    TEAMS {
        varchar team_key PK
        int team_number
    }
    FILTERED_EVENTS {
        varchar event_key PK
        int year
        int week
    }
    MATCHES_EDA {
        varchar match_key PK
        varchar alliance PK
        varchar team_key1
        varchar team_key2
        varchar team_key3
        int auto_points
    }
    MATCH_AUTO_DATA_CALC {
        varchar team_key PK
        varchar match_key PK
        int prev_match_count
    }
```

The public schema is sensibly normalized for core entities. The features schema is intentionally denormalized for EDA and modeling. That is a reasonable warehouse pattern as long as the refresh path, schema history, and feature semantics remain synchronized.

## What the Codebase Already Does Well

### 1. It has a clear data lifecycle

The broad path from APIs to JSON to SQL to features to analysis is easy to explain. That matters. A new contributor can infer the purpose of most top-level directories without reading every script.

### 2. Raw JSON acts as an ingestion boundary

Files under `api/json/` are ignored in [`.gitignore`](.gitignore), which avoids committing large generated data. More importantly, the design acknowledges that external API responses are inputs worth retaining locally during development.

### 3. The relational design is more intentional than the script layout

[db/schema.sql](db/schema.sql) defines keys, foreign keys, and indexes. The split among `matches`, `match_data_2026`, and `match_teams` is useful: metadata, alliance-level scoring, and team participation are not flattened too early.

### 4. Feature tables are separated from source tables

Derived tables live in `features`, not `public`. This is a good governance instinct. It preserves the option to rebuild features from source facts and makes it easier to distinguish raw-ish facts from model-ready views.

### 5. Match payload parsing is centralized

[sql/match_data_from_json.py](sql/match_data_from_json.py) concentrates the 2026 match parsing logic. That is much easier to maintain than duplicating score extraction inside every loader or model script.

## Findings and Critique

### Finding 1: The runtime configuration contract is internally inconsistent

Severity: High

Evidence:

- [api/scripts/districts_tba_pull_save_create_reference.py](api/scripts/districts_tba_pull_save_create_reference.py), [api/scripts/events_tbd_pull_save_create_reference.py](api/scripts/events_tbd_pull_save_create_reference.py), and [api/scripts/matches_tba_pull_save_create_reference.py](api/scripts/matches_tba_pull_save_create_reference.py) import `START_YEAR` and `END_YEAR` from `config`, but the inspected config modules only expose `CURR_YEAR` in [config/general.py](config/general.py).
- Several SQL loaders import `SCHEMA` from `config`, including [sql/scripts/load_files/40_create_matches_data_df_json_sql.py](sql/scripts/load_files/40_create_matches_data_df_json_sql.py), but no config assignment for `SCHEMA` was found.
- [config/directories.py](config/directories.py) defines `REPO_ROOT = Path(__file__).resolve().parent`, which resolves to the `config` directory. It then builds paths such as `REPO_ROOT / 'api' / 'json'`, which points to `config/api/json` rather than the repository's `api/json` directory.
- [api/get_folder_file_both_from_api_url.py](api/get_folder_file_both_from_api_url.py) uses a different repository-root strategy by walking upward until `.git` is found.
- The environment-variable names documented in [README.md](README.md) do not match the implementation in [config/database.py](config/database.py) and [api/get_credentials.py](api/get_credentials.py). For example, code expects `POSTGRESQL_USERNAME`, `POSTGRESQL_PASSWORD`, `POSTGRESQL_DB_FRCDATASCIENCE`, `DATABASE_URL`, and `THE_BLUE_ALLIANCE_API_KEY`, while the README describes a different naming scheme.

Why it matters:

Configuration is a cross-cutting architectural contract. When it is inconsistent, every other subsystem appears unreliable even if its business logic is correct. The issue is not merely naming style; it affects path resolution, script importability, and setup instructions.

Recommendation:

Create one import-safe settings module with explicit fields for:

- Repository root and all generated-data directories
- Schema name
- Season range or selected season set
- Database DSN and explicit DB fields, with one documented source of truth
- FIRST and TBA credentials

Validate settings once at command startup and fail with precise messages. Update the README from that contract, not independently of it.

### Finding 2: Schema migrations and feature refresh SQL have drifted apart

Severity: High

Evidence:

- [db/migrations/20260514210821_remove_district_keys_predicted_time_from_matches_eda_and_add_opponent_auto_and_auto_won.sql](db/migrations/20260514210821_remove_district_keys_predicted_time_from_matches_eda_and_add_opponent_auto_and_auto_won.sql) drops `district_key` and `predicted_time` from `features.matches_eda`.
- The current dump in [db/schema.sql](db/schema.sql) reflects that migrated shape.
- [sql/features/load_files/20_add_matches_eda_data.sql](sql/features/load_files/20_add_matches_eda_data.sql) still inserts into `district_key` and `predicted_time`, which no longer exist in the table.
- [sql/SQL_Schema.sql](sql/SQL_Schema.sql) is an older generated schema artifact with a materially different shape from [db/schema.sql](db/schema.sql). It omits newer schema evolution and still represents older relationships.

Why it matters:

The feature refresh path is supposed to be a stable bridge from normalized facts to analysis tables. If it does not agree with the migration-backed schema, the pipeline can fail after doing ingestion and loading work successfully. Multiple schema authorities also increase the chance that future changes are made against the wrong definition.

Recommendation:

Treat migrations plus the generated [db/schema.sql](db/schema.sql) dump as the only schema authority. Either remove [sql/SQL_Schema.sql](sql/SQL_Schema.sql), clearly mark it as archival, or regenerate it from the same canonical source. Add an integration check that applies migrations to an empty database and runs all feature SQL files in order.

### Finding 3: Pipeline orchestration is convention-driven rather than command-driven

Severity: Medium-High

Evidence:

- [sql/scripts/load_all_table_data.py](sql/scripts/load_all_table_data.py) relies on lexicographically sorted filenames and subprocess calls to execute all Python and SQL steps.
- API scripts perform work at module import time rather than exposing side-effect-free functions plus explicit entrypoints.
- The pipeline mixes append-oriented relational loads, full feature-table truncation in [sql/features/load_files/10_copy_filtered_events.sql](sql/features/load_files/10_copy_filtered_events.sql) and [sql/features/load_files/20_add_matches_eda_data.sql](sql/features/load_files/20_add_matches_eda_data.sql), and append behavior for [data_science/feature_eng/auto_averages.py](data_science/feature_eng/auto_averages.py).
- Reference JSON files operate like manifests, but there is no explicit stage-status model or run metadata.

Why it matters:

Filename ordering is acceptable as a temporary migration pattern, but it becomes fragile once the project has partial reruns, failures halfway through a load, multiple seasons, or scheduled runs. The current design makes it hard to answer operational questions such as "Which stage completed?", "Can I rerun just features?", and "Is this table from the same raw-data snapshot as the model input?"

Recommendation:

Introduce a small CLI or task layer with stage-oriented commands, for example:

```text
frc-data pull districts --season 2026
frc-data pull events --season 2026
frc-data load core --season 2026
frc-data build features --season 2026
frc-data train auto-model --season 2026
```

Each command should validate settings, log inputs and outputs, and define its rerun behavior. Existing scripts can be refactored into callable functions first, then wired behind that CLI without a disruptive reorganization.

### Finding 4: The API layer lacks resilience and a single client abstraction

Severity: Medium-High

Evidence:

- [api/pull_and_save_api_url_data.py](api/pull_and_save_api_url_data.py) calls `requests.get()` directly in four code paths with no shared session, timeout, retry, or rate-limit handling.
- Directory creation is not centralized before file writes.
- Error handling mostly prints HTTP status and continues.
- The debug branch in `pull_and_save_teams_frc_api_data()` can reference response-derived variables after a file-skip path where those variables were not assigned.
- [api/get_folder_file_both_from_api_url.py](api/get_folder_file_both_from_api_url.py) performs URL-to-folder routing procedurally. The TBA loop uses `exit` where a loop break appears to be intended, and `call_folder` is not initialized before conditional use.

Why it matters:

API behavior is outside the repository's control. Transient 429s, 5xx responses, invalid credentials, and partial network failures are routine. A resilient client layer prevents many operational failures from leaking into every orchestration script.

Recommendation:

Create a thin `ApiClient` boundary for FIRST and TBA with:

- Session reuse
- Request timeout policy
- Bounded retries with backoff for retryable statuses
- Explicit auth validation errors
- Structured result objects or exceptions instead of print-only failure paths
- A raw-store helper that creates directories and performs atomic-ish write patterns where appropriate

Unit-test URL routing and file-name generation separately from network calls.

### Finding 5: Season-specific data shape is embedded across layers

Severity: Medium

Evidence:

- The schema uses `public.match_data_2026` in [db/schema.sql](db/schema.sql).
- [sql/match_data_from_json.py](sql/match_data_from_json.py) has a `matches_data_2026_from_json()` parser with 2026 game-specific fields.
- [sql/scripts/load_files/40_create_matches_data_df_json_sql.py](sql/scripts/load_files/40_create_matches_data_df_json_sql.py) globs only `*2026*` match files.
- [config/general.py](config/general.py) defines `CURR_YEAR = 2026`.
- Modeling scripts use a hard-coded split date of `2026-04-10` in [data_science/lightgbm/first_try_light_gbm.py](data_science/lightgbm/first_try_light_gbm.py) and [data_science/test_opr.py](data_science/test_opr.py).

Why it matters:

FRC scoring formats change by season, so some season specialization is appropriate. The architectural issue is not that 2026 logic exists; it is that season selection, parser choice, database naming, and experiment periods are scattered across unrelated layers. That makes the project harder to extend to 2027 or backfill older years without editing a wide swath of files.

Recommendation:

Separate season-invariant concepts from season-specific score adapters:

- Keep `matches`, `teams`, `events`, and `match_teams` stable.
- Represent game-specific score facts through versioned adapters and either season-specific tables behind a stable view or a normalized score-fact model with season metadata.
- Move training cutoffs and active seasons into explicit experiment or pipeline configuration.

### Finding 6: Rolling feature generation likely leaks current-match data

Severity: High for modeling correctness

Evidence:

- [data_science/feature_eng/auto_averages.py](data_science/feature_eng/auto_averages.py) appends the current row's `auto_points` to `team_auto_dict[current_team]` before calculating historical statistics.
- [data_science/feature_eng/function_for_feature_eng.py](data_science/feature_eng/function_for_feature_eng.py) calculates many window statistics from slices such as `scores[-i:]`. After the current score has been appended, those slices include the target match's value for common window sizes.
- The trend path does not follow the same "last N prior matches" window convention as the other statistics, making feature semantics harder to reason about.

Why it matters:

Temporal features intended for prediction must only use information available before the match being predicted. Leakage can make model evaluation look stronger than reality and conceal genuine generalization problems.

Recommendation:

Define temporal feature semantics precisely: each feature row for match `M` may use only matches with `actual_time < M.actual_time`, with deterministic tie handling. Compute features from the prior history first, then append the current result to history. Add small fixture tests where the correct mean, max, count, and trend can be verified by hand.

### Finding 7: The analytics layer is exploratory, but its boundaries are not labeled as such

Severity: Medium

Evidence:

- [data_science/lightgbm/first_try_light_gbm.py](data_science/lightgbm/first_try_light_gbm.py) trains a model directly at script load, mixes model fitting with Streamlit plotting, and imports metrics/helpers that are not used.
- [data_science/test_opr.py](data_science/test_opr.py) contains both a manual OPR workflow and a Lasso experiment. The Lasso section calls `model.fit(X_train, y_test)` rather than pairing training features with training labels, and its feature matrix still contains team-key strings rather than a validated numeric encoding.
- [data_science/eda/test_streamlit_server.py](data_science/eda/test_streamlit_server.py) is useful for investigation, but the README describes some behavior more strongly than the implementation currently provides.

Why it matters:

Exploratory scripts are normal in a data science repo. The architectural risk appears when experiments look like sanctioned pipeline stages but lack the protections expected of those stages. Readers may assume model metrics are reproducible, feature inputs are stable, or dashboard behavior matches documentation when those are still evolving.

Recommendation:

Make the distinction explicit:

- Keep prototypes in an `experiments` or clearly labeled research area.
- Promote stable workflows into parameterized modules with tests.
- Store experiment configuration, metrics, and split definitions together.
- Use the existing `predictions` schema only after prediction outputs have a documented contract.

### Finding 8: Reproducibility and documentation have drifted from code

Severity: Medium

Evidence:

- [requirements.txt](requirements.txt) includes ingestion, SQL, plotting, and Streamlit dependencies, but active Python files import packages not listed there: `lightgbm`, `scikit-learn`, `scipy`, `seaborn`, and `dython`.
- The README documents a legitimate end-to-end intent, but its configuration variable names do not match source code, its feature/model claims are ahead of implementation in places, and it does not mention the current schema/feature SQL mismatch.
- No conventional automated test setup, CI workflow, or project package metadata was found alongside the implementation. The files named `test_*.py` are currently exploratory scripts rather than assertion-driven tests.

Why it matters:

A data platform is reproducible only when code, dependencies, schema, and operating instructions agree. Drift here increases onboarding cost and makes future results harder to trust.

Recommendation:

At minimum:

- Split dependency capture into runtime, data-science, and development/test groups, or adopt a `pyproject.toml`-based toolchain.
- Update README configuration and run commands from a single verified source of truth.
- Add CI checks for importability, unit tests, migration application, and feature SQL execution against a small fixture database.

## Suggested Target Architecture

The current code does not need to abandon its directory structure immediately. A practical target is to introduce clearer boundaries while preserving the existing data flow.

```mermaid
flowchart LR
    SETTINGS[Validated settings]
    CLI[CLI or task entrypoints]
    API[Typed API clients\nFIRST and TBA]
    RAW[Raw store\nJSON plus optional run manifest]
    PARSERS[Season-aware parsers]
    CORE[(Public relational tables)]
    MIGRATIONS[Migrations and schema dump\none authority]
    FEATURES[Feature builders\nSQL and Python]
    FEATUREDB[(Features schema)]
    EXPERIMENTS[Experiments and notebooks]
    TRAIN[Stable training jobs]
    PRED[(Predictions schema)]
    TESTS[Fixture and integration tests]

    SETTINGS --> CLI
    CLI --> API
    API --> RAW
    RAW --> PARSERS
    PARSERS --> CORE
    MIGRATIONS --> CORE
    CORE --> FEATURES
    FEATURES --> FEATUREDB
    FEATUREDB --> EXPERIMENTS
    FEATUREDB --> TRAIN
    TRAIN --> PRED
    TESTS --> API
    TESTS --> PARSERS
    TESTS --> FEATURES
```

The target architecture introduces three ideas that the current repository is ready for:

1. Settings are validated once and then passed through the system, rather than rediscovered inconsistently.
2. Season specialization is an adapter concern, not a scattered global assumption.
3. Pipeline stages are commands with contracts, not only scripts that happen to run in a useful order.

## Recommended Improvement Roadmap

### Phase 1: Restore Contract Integrity

1. Fix and consolidate settings: repository root, `SCHEMA`, season range, credentials, and database DSN.
2. Synchronize README environment variable names with actual settings.
3. Repair `features.matches_eda` refresh SQL so it matches current migrations.
4. Retire, regenerate, or explicitly archive [sql/SQL_Schema.sql](sql/SQL_Schema.sql).
5. Add a tiny import/settings smoke test and a migration-plus-feature-SQL integration test.

### Phase 2: Harden Ingestion and Loads

1. Introduce FIRST and TBA client classes or functions with retries, timeouts, and consistent errors.
2. Centralize raw-store directory and filename behavior.
3. Refactor import-time scripts into functions plus command entrypoints.
4. Define idempotency rules for each stage and record run metadata where useful.
5. Replace ad hoc SQL identifier interpolation where external input could eventually reach query construction.

### Phase 3: Make the Feature Layer Trustworthy

1. Fix rolling feature windows so they are strictly historical.
2. Add fixture tests for temporal features and phase transformation.
3. Document feature definitions and table refresh semantics.
4. Decide whether feature tables are always rebuildable, incrementally maintained, or split by category.

### Phase 4: Mature the Analytics Surface

1. Separate exploratory scripts from stable model pipelines.
2. Parameterize active season, split date, target, and feature set.
3. Capture missing data-science dependencies in environment tooling.
4. Store stable prediction outputs only after the `predictions` schema contract is clear.

## Recommended Test Matrix

| Test Type | Purpose | Example Coverage |
| --- | --- | --- |
| Settings smoke test | Catch missing config exports and path mistakes | `REPO_ROOT`, `SCHEMA`, season bounds, env-name mapping |
| URL routing unit tests | Make raw-store filenames deterministic | TBA district/event/match/team URLs and FIRST URLs |
| Parser fixture tests | Lock down score extraction | 2026 match JSON to expected public-table rows |
| Migration integration test | Ensure schema builds from zero | Apply migrations into empty PostgreSQL |
| Feature SQL integration test | Prevent schema/SQL drift | Run ordered `sql/features/load_files` scripts |
| Temporal feature test | Prevent leakage | Prior-match-only rolling stats |
| Pipeline rerun test | Validate idempotency | Load same fixture twice without duplicate facts |
| Model smoke test | Separate experiment breakage from pipeline breakage | Tiny feature table produces metrics without type or shape errors |

## Suggested Package Evolution

This is a direction, not a required immediate rewrite. The repo could gradually converge toward a structure like:

```text
frc_data/
  settings.py
  cli.py
  api/
    clients.py
    routes.py
    raw_store.py
  ingest/
    districts.py
    events.py
    teams.py
    matches.py
  parsing/
    common.py
    season_2026.py
  db/
    connections.py
    repositories.py
  features/
    phase.py
    rolling_auto.py
experiments/
  opr.py
  lightgbm_auto.py
tests/
  fixtures/
```

The migration can be incremental: first extract stable functions from current scripts, then move only the functions that have proven reusable.

## Closing Judgment

The project already expresses a good platform idea: use raw snapshots to feed a relational warehouse, then build ML-ready features and analytical tools on top. The work worth doing next is chiefly contract work. Make configuration truthful, schema evolution authoritative, API calls resilient, temporal features trustworthy, and pipeline stages explicit. Once those foundations are in place, adding seasons, stronger models, and richer analytics will become much cheaper and much safer.# FRC Data Science Project — Architecture & Code Review

> A structured critique of the current codebase with diagrams, identified
> issues, and prioritized recommendations. Prepared as a working document
> for the maintainers; nothing in here changes runtime behavior.

---

## 1. Executive Summary

The project implements a **classic data-pipeline architecture**:
External APIs → JSON snapshots on disk → relational store (PostgreSQL) →
feature engineering → ML models + interactive EDA. The layering is
*directionally correct* and the use of `pathlib`, `SQLAlchemy`, and
`dbmate`-style migrations is solid.

However, the codebase shows symptoms of **organic growth without
periodic consolidation**:

- Several **import-time failures** (missing config constants).
- Significant **duplication** in the API and ETL layers.
- **No error handling, retries, or logging** anywhere in the pipeline.
- **Iterative pandas loops** where vectorized operations would be 50–100× faster.
- **Train/test contamination** in one of the ML scripts.
- **No automated tests**.

The good news: the directory layout already suggests the *correct*
abstractions. Most fixes are mechanical refactors, not redesigns.

---

## 2. High-Level Architecture

```mermaid
flowchart TD
    subgraph EXT["External Services"]
        FRC["FRC API v3"]
        TBA["The Blue Alliance API v3"]
    end

    subgraph API["api/"]
        URL["frc_and_tba_api_url_builders.py"]
        CRED["get_credentials.py"]
        PATH["get_folder_file_both_from_api_url.py"]
        PULL["pull_and_save_api_url_data.py"]
        SCRIPTS["scripts/*_pull_save_create_reference.py"]
    end

    subgraph DISK["api/json/ - raw JSON cache"]
        D1["districts_data/"]
        D2["events_data/"]
        D3["matches_data/"]
        D4["teams_data/"]
        D5["reference_data/"]
    end

    subgraph SQL["sql/ - ETL + DB access"]
        LOAD["scripts/load_files/10..40_*.py"]
        TRANSFORM["match_data_from_json.py"]
        DBACCESS["data_science_db.py / load_db_to_df.py"]
        FEATSQL["features/load_files/*.sql"]
    end

    subgraph PG["PostgreSQL"]
        PUB["public.* - raw"]
        FEAT["features.* - engineered"]
        PRED["predictions.* - empty"]
    end

    subgraph DS["data_science/"]
        FE["feature_eng/auto_averages.py"]
        OPR["test_opr.py"]
        LGBM["lightgbm/first_try_light_gbm.py"]
        EDA["eda/test_streamlit_server.py"]
    end

    FRC --> URL
    TBA --> URL
    URL --> PULL
    CRED --> PULL
    PATH --> PULL
    PULL --> DISK
    SCRIPTS --> PULL

    DISK --> LOAD
    TRANSFORM --> LOAD
    LOAD --> PUB
    PUB --> FEATSQL --> FEAT

    FEAT --> FE --> FEAT
    FEAT --> OPR
    FEAT --> LGBM
    FEAT --> EDA
    DBACCESS -.-> SQL
    DBACCESS -.-> DS
```

### Layer responsibilities

| Layer | Folder | Responsibility |
|---|---|---|
| Config | [config/](config/) | Paths, DB URL, feature lists, year range |
| Ingestion | [api/](api/) | Build URLs, auth, fetch, persist JSON |
| Storage (raw) | `api/json/` + [db/migrations/](db/migrations/) | JSON snapshots, schema versioning |
| ETL | [sql/scripts/load_files/](sql/scripts/load_files/) | JSON → DataFrame → PostgreSQL |
| Feature eng. | [data_science/feature_eng/](data_science/feature_eng/), [sql/features/](sql/features/) | Rolling stats, derived features |
| Modeling | [data_science/lightgbm/](data_science/lightgbm/), [data_science/test_opr.py](data_science/test_opr.py) | LightGBM regressor, OPR |
| Exploration | [data_science/eda/](data_science/eda/) | Streamlit dashboard |

---

## 3. Module-by-Module Critique

### 3.1 `config/`
Centralizes paths and DB connection. Good use of `pathlib` and `dotenv`.

**Issues**
- `START_YEAR`, `END_YEAR`, and `SCHEMA` are imported by [api/scripts/](api/scripts/) but **never defined** anywhere — these scripts cannot run as-is.
- DB host/port are hard-coded to `127.0.0.1:5432` despite `.env` being loaded for everything else.
- `IMPORTANT_FEATURES` is a long hand-edited Python list with no link to the DB schema; columns can silently drift.

### 3.2 `api/`
A reasonable URL-builder / fetcher split.

**Issues**
- [pull_and_save_api_url_data.py](api/pull_and_save_api_url_data.py) contains 3–4 near-identical `pull_and_save_*` functions — only headers/base URL differ.
- [get_folder_file_both_from_api_url.py](api/get_folder_file_both_from_api_url.py) reverse-engineers local file paths from URLs via regex + a walk to find `.git`. Brittle; a dict lookup of `{endpoint_kind: folder}` would replace it.
- **No `try/except` around `requests.get`**, no retry, no rate limit, no status-code check beyond the obvious. A single transient 5xx aborts the whole multi-year pull.
- `first_api_bad_before_2020/` is named as if it were dead code but is still in the import graph.

### 3.3 `sql/`
DB access layer is the strongest part of the codebase: a shared SQLAlchemy `ENGINE`, context-managed connections, and a sensible read helper.

**Issues**
- [match_data_from_json.py](sql/match_data_from_json.py) has a function literally named `matches_data_2026_from_json` — the season is baked into the function name, meaning next year requires a copy-paste. The same function also assigns `total_auto_points` twice with different values.
- [scripts/load_files/40_create_matches_data_df_json_sql.py](sql/scripts/load_files/40_create_matches_data_df_json_sql.py) globs only `*2026*` JSON files.
- Duplicate detection works by `SELECT key FROM table` into a Python `set` — this does not scale and ignores the obvious `INSERT ... ON CONFLICT DO NOTHING`.
- [write_df_to_sql.py](sql/write_df_to_sql.py) calls an interactive `confirm()` from inside the write function, blocking any batch / scheduled use.
- [get_district_from_state_country.py](sql/get_district_from_state_country.py) hard-codes a partial US-state → district map in Python; this is data and belongs in a table.

### 3.4 `db/`
`dbmate`-style timestamped migrations — good.

**Issues**
- 10 migrations in 3 days, several of which rename or remove columns added a day earlier — schema was clearly evolved at runtime rather than designed. Worth a consolidation pass before the next season.
- Filename `20260512174020_add feature and predictions schema.sql` contains a space — annoying for shell tools.

### 3.5 `data_science/`

**`feature_eng/auto_averages.py`**
- Uses an explicit Python loop over matches to compute rolling means / max / min / std / trend. `df.groupby('team_key').rolling(window=N)` replaces ~80% of this code and runs orders of magnitude faster.
- Import line `from function_for_feature_eng import ...` only works when CWD happens to be the right directory — should be `from data_science.feature_eng.function_for_feature_eng import ...`.
- Magic lookback windows `[3, 5, 10, 'season']` should live in `config/data_science.py`.

**`test_opr.py`**
- Sets up the OPR linear system correctly, but the Lasso section fits on `X_train` and then evaluates against `y_test` paired with predictions from `X_train` — **train/test leakage / mis-aligned shapes**. Numbers reported from this script are not trustworthy.
- No cross-validation; one fixed date split.

**`lightgbm/first_try_light_gbm.py`**
- Hyperparameters hand-picked; the test set is also passed as `eval_set` during training (early-stopping on the test set ⇒ optimistic test metrics).
- Categorical handling is implicit (LightGBM infers it). Make it explicit via `categorical_feature=[...]`.

**`eda/test_streamlit_server.py`**
- `@st.cache_data` with no TTL — dashboard shows stale data after DB writes until the Python process restarts.
- Unused `dython` import; large block of commented-out code.

### 3.6 `TestCode/` and root scripts
- `test.py` / `test_plot.ipynb` are ad-hoc scratch — not pytest tests. There is no automated test suite at all.
- `confirm.py` lives at repo root but is really a small utility; belongs in a `utils/` package.
- `requirements.txt` does not pin versions — reproducibility risk.
- The file `ystemctl list-units --type=service --state=running:q` at the repo root looks like an accidental commit from a botched terminal command. Delete it.

---

## 4. Cross-Cutting Concerns

```mermaid
flowchart LR
    subgraph Concerns
        C1[Configuration]
        C2[Error handling]
        C3[Logging]
        C4[Testing]
        C5[Transactions]
        C6[Reproducibility]
    end
    C1 -->|missing consts| BUG1[Import failures]
    C2 -->|absent| BUG2[Whole pipeline aborts on 1 bad response]
    C3 -->|print only| BUG3[Hard to diagnose batch runs]
    C4 -->|none| BUG4[Refactors are scary]
    C5 -->|if_exists=append| BUG5[Partial loads on crash]
    C6 -->|unpinned deps| BUG6[Works on my machine]
```

| Concern | State | Recommendation |
|---|---|---|
| Configuration | Partial — paths good, constants missing | Define `START_YEAR`, `END_YEAR`, `SCHEMA`; move all magic numbers in |
| Secrets | OK — `.env` + base64 | Add `.env.example`, document required vars |
| Error handling | Absent | Standard wrapper around `requests` with retry + status-code raise |
| Logging | `print` only | Adopt stdlib `logging`, one logger per module |
| Testing | None | `pytest` + fixtures for JSON samples; start with `match_data_from_json` |
| Transactions | Append-only, no rollback | Wrap each load in a transaction; use `ON CONFLICT DO NOTHING` |
| Naming | Mostly snake_case, some inconsistencies (`team_key1/2/3`, year baked into names) | Rename per-season helpers to take `year`; rename position-keyed columns to colour/role |

---

## 5. Code Smells — Ranked

| # | Severity | Location | Smell |
|---|---|---|---|
| 1 | 🔴 Critical | [api/scripts/*.py](api/scripts/) | Imports of undefined `START_YEAR`, `END_YEAR`, `SCHEMA` |
| 2 | 🔴 Critical | [data_science/test_opr.py](data_science/test_opr.py) | Train/test shape mismatch in Lasso block |
| 3 | 🔴 Critical | [data_science/feature_eng/auto_averages.py](data_science/feature_eng/auto_averages.py) | Broken relative import |
| 4 | 🟠 High | [api/pull_and_save_api_url_data.py](api/pull_and_save_api_url_data.py) | 3× near-duplicate fetch functions |
| 5 | 🟠 High | [api/get_folder_file_both_from_api_url.py](api/get_folder_file_both_from_api_url.py) | Regex + `.git`-walk to derive paths |
| 6 | 🟠 High | API layer | No retries / status-check / rate limiting |
| 7 | 🟠 High | [sql/get_district_from_state_country.py](sql/get_district_from_state_country.py) | Hard-coded partial state map |
| 8 | 🟡 Medium | [sql/match_data_from_json.py](sql/match_data_from_json.py) | Season hard-coded into function name; duplicate field assignment |
| 9 | 🟡 Medium | [sql/scripts/load_files/](sql/scripts/load_files/) | Python-set duplicate detection instead of SQL upserts |
| 10 | 🟡 Medium | [data_science/feature_eng/auto_averages.py](data_science/feature_eng/auto_averages.py) | Row-by-row pandas loops |
| 11 | 🟡 Medium | [data_science/lightgbm/first_try_light_gbm.py](data_science/lightgbm/first_try_light_gbm.py) | Early-stopping on test set |
| 12 | 🟡 Medium | [data_science/eda/test_streamlit_server.py](data_science/eda/test_streamlit_server.py) | Cache without TTL; dead imports |
| 13 | 🟡 Medium | [sql/write_df_to_sql.py](sql/write_df_to_sql.py) | Interactive prompt inside writer |
| 14 | 🟢 Low | [requirements.txt](requirements.txt) | Unpinned versions |
| 15 | 🟢 Low | repo root | Accidental file `ystemctl ... :q` |
| 16 | 🟢 Low | [api/first_api_bad_before_2020/](api/first_api_bad_before_2020/) | Dead/archive code in live tree |

---

## 6. Proposed Target Architecture

A relatively small rearrangement separates *concerns* the current layout already implies:

```mermaid
flowchart TD
    subgraph cfg["config/"]
        C["settings.py - pydantic-settings"]
    end
    subgraph ing["ingestion/ (was api/)"]
        CLI1["clients/frc.py"]
        CLI2["clients/tba.py"]
        BASE["clients/base.py - retry, auth, paging"]
        STORE["storage/json_cache.py"]
    end
    subgraph etl["etl/ (was sql/scripts/)"]
        PARSE["parsers/matches.py"]
        LOAD["loaders/postgres.py - upsert"]
    end
    subgraph db["db/"]
        MIG["migrations/"]
        REPO["repository/ - typed queries"]
    end
    subgraph feat["features/"]
        ROLL["rolling_stats.py - vectorized"]
        SQLFE["sql/*.sql"]
    end
    subgraph models["models/"]
        OPR2["opr.py"]
        GBM["lightgbm/"]
        TRAIN["train.py - CV, configs"]
    end
    subgraph apps["apps/"]
        DASH["dashboard/ - streamlit"]
    end
    subgraph tests["tests/"]
        UT["unit/"]
        IT["integration/"]
        FIX["fixtures/sample_*.json"]
    end

    C --> ing
    C --> etl
    C --> feat
    C --> models
    BASE --> CLI1
    BASE --> CLI2
    CLI1 --> STORE
    CLI2 --> STORE
    STORE --> PARSE
    PARSE --> LOAD
    LOAD --> REPO
    REPO --> feat
    feat --> models
    models --> apps
    tests -.-> ing
    tests -.-> etl
    tests -.-> feat
    tests -.-> models
```

Key differences from today:

1. One `BaseAPIClient` with retry/auth/paging; FRC and TBA subclass it. Replaces all `pull_and_save_*` duplicates.
2. JSON cache is a *storage adapter*, not coupled to URL regex parsing — the client tells the cache where to write.
3. ETL parsers are pure functions (JSON in, records out) → trivially unit-testable.
4. A small `repository/` module wraps SQL queries used by feature/model code, so call sites don't write SQL strings.
5. `models/train.py` is configuration-driven (YAML or pydantic) so OPR and LightGBM share a single train/eval loop with proper CV.
6. `tests/` mirrors source layout.

---

## 7. Prioritized Roadmap

### Tier 1 — Make it run correctly (hours)
1. Add `START_YEAR`, `END_YEAR`, `SCHEMA` to [config/general.py](config/general.py).
2. Fix the import path in [data_science/feature_eng/auto_averages.py](data_science/feature_eng/auto_averages.py).
3. Fix the OPR/Lasso train/test alignment in [data_science/test_opr.py](data_science/test_opr.py).
4. Delete the stray `ystemctl ...` file at repo root.
5. Pin `requirements.txt` (e.g. `pip freeze > requirements.lock.txt`).

### Tier 2 — Remove duplication & brittleness (days)
6. Collapse the `pull_and_save_*` family into a single client with auth strategies.
7. Replace regex path inference with explicit endpoint → folder mapping.
8. Parameterize `matches_data_2026_from_json` by year; pick parser from a `{year: parser}` registry.
9. Switch ETL inserts to `INSERT ... ON CONFLICT DO NOTHING` and drop the in-Python set check.
10. Move the state → district mapping into a seed table or a JSON in `config/`.

### Tier 3 — Robustness & observability (days)
11. Add `requests`-retry (e.g. `urllib3.Retry` adapter) with backoff and status-code checks.
12. Introduce `logging` (one module-level logger) and replace `print`.
13. Stand up `pytest` with fixtures for one JSON sample per entity; cover parsers first.
14. Wrap each loader in an explicit DB transaction; remove the interactive prompt from `write_df_to_postgres` (move it to the calling CLI).
15. Add `ttl=` to Streamlit caches.

### Tier 4 — Modeling discipline (days)
16. Vectorize rolling stats with `groupby().rolling()`.
17. Add a held-out validation split distinct from the test set; stop using test as `eval_set`.
18. Make hyperparameters and feature lists config-driven; record run metadata.
19. Add a baseline model (mean / per-team mean) to anchor metric comparisons.

### Tier 5 — Polish (days)
20. Move `confirm.py` into a `utils/` package; add type hints; run `ruff` + `mypy` in CI.
21. Write a real `README.md` (link to this doc; include `make`/`uv` commands).
22. Archive `api/first_api_bad_before_2020/` outside the import tree.

---

## 8. What's Already Good

- Clear *intent* of layering (ingestion / storage / features / models / apps).
- Migrations under version control with `dbmate`-style naming.
- Centralized DB engine with context managers.
- Separation of raw (`public`) vs engineered (`features`) vs serving (`predictions`) schemas — exactly the right shape.
- Two complementary modeling approaches (OPR linear, GBM nonlinear) — good for sanity-checking.
- Streamlit-based EDA next to the data, not in notebooks scattered on a laptop.

The bones are right. The work is mostly *tightening the joints* rather than rebuilding.


Area	Current Assessment	Architectural Consequence
Configuration	Centralized in intent, inconsistent in implementation	Canonical scripts can fail before doing useful work
Schema management	Migrations exist, but generated/legacy SQL and refresh SQL disagree	Feature refreshes can fail against the current schema
Pipeline orchestration	Ordered scripts work as a manual convention	Recovery, repeatability, and partial reruns are fragile
Season handling	2026 logic is embedded through ingestion, schema, and models	Historical and future seasons require code edits across layers
Feature correctness	Useful temporal ideas, but at least one rolling window likely leaks current-match data	Predictive measurements can be optimistic or misleading
Tests and reproducibility	README exists, but it has drifted from source and dependency capture is incomplete	New contributors and future reruns will be costly

Subsystem	Primary Responsibility	Representative Files
Configuration	Environment variables, directory locations, feature column lists	config/database.py, config/directories.py, config/data_science.py
API URL construction	Build FIRST and TBA endpoints	api/frc_and_tba_api_url_builders.py
API fetching and raw storage	Request JSON and save it under `api/json/`	api/pull_and_save_api_url_data.py, api/get_folder_file_both_from_api_url.py
Pull orchestration	Batch districts, events, teams, and matches; write reference JSON	api/scripts/
Relational loading	Convert raw JSON into DataFrames and append to PostgreSQL	sql/scripts/load_files/, sql/write_df_to_sql.py
JSON parsing	Convert TBA match payloads to relational row structures	sql/match_data_from_json.py
Schema authority	Define public, features, and predictions database structure	db/schema.sql, db/migrations/
SQL feature refresh	Populate filtered events and denormalized match feature tables	sql/features/load_files/
Feature engineering	Derive phase columns and rolling autonomous scoring statistics	data_science/data_science_functions.py, data_science/feature_eng/
Analytics consumers	EDA dashboard and experimental model scripts	data_science/eda/test_streamlit_server.py, data_science/lightgbm/first_try_light_gbm.py, data_science/test_opr.py

Test Type	Purpose	Example Coverage
Settings smoke test	Catch missing config exports and path mistakes	`REPO_ROOT`, `SCHEMA`, season bounds, env-name mapping
URL routing unit tests	Make raw-store filenames deterministic	TBA district/event/match/team URLs and FIRST URLs
Parser fixture tests	Lock down score extraction	2026 match JSON to expected public-table rows
Migration integration test	Ensure schema builds from zero	Apply migrations into empty PostgreSQL
Feature SQL integration test	Prevent schema/SQL drift	Run ordered `sql/features/load_files` scripts
Temporal feature test	Prevent leakage	Prior-match-only rolling stats
Pipeline rerun test	Validate idempotency	Load same fixture twice without duplicate facts
Model smoke test	Separate experiment breakage from pipeline breakage	Tiny feature table produces metrics without type or shape errors

Layer	Folder	Responsibility
Config	config/	Paths, DB URL, feature lists, year range
Ingestion	api/	Build URLs, auth, fetch, persist JSON
Storage (raw)	`api/json/` + db/migrations/	JSON snapshots, schema versioning
ETL	sql/scripts/load_files/	JSON → DataFrame → PostgreSQL
Feature eng.	data_science/feature_eng/, sql/features/	Rolling stats, derived features
Modeling	data_science/lightgbm/, data_science/test_opr.py	LightGBM regressor, OPR
Exploration	data_science/eda/	Streamlit dashboard

Concern	State	Recommendation
Configuration	Partial — paths good, constants missing	Define `START_YEAR`, `END_YEAR`, `SCHEMA`; move all magic numbers in
Secrets	OK — `.env` + base64	Add `.env.example`, document required vars
Error handling	Absent	Standard wrapper around `requests` with retry + status-code raise
Logging	`print` only	Adopt stdlib `logging`, one logger per module
Testing	None	`pytest` + fixtures for JSON samples; start with `match_data_from_json`
Transactions	Append-only, no rollback	Wrap each load in a transaction; use `ON CONFLICT DO NOTHING`
Naming	Mostly snake_case, some inconsistencies (`team_key1/2/3`, year baked into names)	Rename per-season helpers to take `year`; rename position-keyed columns to colour/role

#	Severity	Location	Smell
1	🔴 Critical	api/scripts/*.py	Imports of undefined `START_YEAR`, `END_YEAR`, `SCHEMA`
2	🔴 Critical	data_science/test_opr.py	Train/test shape mismatch in Lasso block
3	🔴 Critical	data_science/feature_eng/auto_averages.py	Broken relative import
4	🟠 High	api/pull_and_save_api_url_data.py	3× near-duplicate fetch functions
5	🟠 High	api/get_folder_file_both_from_api_url.py	Regex + `.git`-walk to derive paths
6	🟠 High	API layer	No retries / status-check / rate limiting
7	🟠 High	sql/get_district_from_state_country.py	Hard-coded partial state map
8	🟡 Medium	sql/match_data_from_json.py	Season hard-coded into function name; duplicate field assignment
9	🟡 Medium	sql/scripts/load_files/	Python-set duplicate detection instead of SQL upserts
10	🟡 Medium	data_science/feature_eng/auto_averages.py	Row-by-row pandas loops
11	🟡 Medium	data_science/lightgbm/first_try_light_gbm.py	Early-stopping on test set
12	🟡 Medium	data_science/eda/test_streamlit_server.py	Cache without TTL; dead imports
13	🟡 Medium	sql/write_df_to_sql.py	Interactive prompt inside writer
14	🟢 Low	requirements.txt	Unpinned versions
15	🟢 Low	repo root	Accidental file `ystemctl ... :q`
16	🟢 Low	api/first_api_bad_before_2020/	Dead/archive code in live tree

Architecture Report #1

Description

Architecture Review: FRC Data Science Project

Review Scope

Executive Assessment

Current System Shape

Repository Responsibilities

Current Data Flow

Simplified Relational Shape

What the Codebase Already Does Well

1. It has a clear data lifecycle

2. Raw JSON acts as an ingestion boundary

3. The relational design is more intentional than the script layout

4. Feature tables are separated from source tables

5. Match payload parsing is centralized

Findings and Critique

Finding 1: The runtime configuration contract is internally inconsistent

Finding 2: Schema migrations and feature refresh SQL have drifted apart

Finding 3: Pipeline orchestration is convention-driven rather than command-driven

Finding 4: The API layer lacks resilience and a single client abstraction

Finding 5: Season-specific data shape is embedded across layers

Finding 6: Rolling feature generation likely leaks current-match data

Finding 7: The analytics layer is exploratory, but its boundaries are not labeled as such

Finding 8: Reproducibility and documentation have drifted from code

Suggested Target Architecture

Recommended Improvement Roadmap

Phase 1: Restore Contract Integrity

Phase 2: Harden Ingestion and Loads

Phase 3: Make the Feature Layer Trustworthy

Phase 4: Mature the Analytics Surface

Recommended Test Matrix

Suggested Package Evolution

Closing Judgment

1. Executive Summary

2. High-Level Architecture

Layer responsibilities

3. Module-by-Module Critique

3.1 config/

3.2 api/

3.3 sql/

3.4 db/

3.5 data_science/

3.6 TestCode/ and root scripts

4. Cross-Cutting Concerns

5. Code Smells — Ranked

6. Proposed Target Architecture

7. Prioritized Roadmap

Tier 1 — Make it run correctly (hours)

Tier 2 — Remove duplication & brittleness (days)

Tier 3 — Robustness & observability (days)

Tier 4 — Modeling discipline (days)

Tier 5 — Polish (days)

8. What's Already Good

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

3.1 `config/`

3.2 `api/`

3.3 `sql/`

3.4 `db/`

3.5 `data_science/`

3.6 `TestCode/` and root scripts