Skip to content

Architecture Report #1

@vritant24

Description

@vritant24

Architecture Review: FRC Data Science Project

Review Scope

This review examines the repository as an evolving analytics platform rather than as a set of isolated scripts. The focus is structure, architecture, data contracts, operational reliability, and the path from exploratory work to a repeatable pipeline.

The review was based on direct inspection of the current source tree, including:

This was a source and architecture review. It did not execute API pulls, migrations, or model training. Editor diagnostics were clean for the inspected Python folders, but several findings below are runtime-contract or data-correctness issues that static diagnostics do not catch.

Executive Assessment

The repository has a promising core shape:

  1. External FRC data is persisted as raw JSON before transformation.
  2. JSON data is normalized into relational PostgreSQL tables.
  3. Feature tables are derived into a dedicated features schema.
  4. Modeling and EDA consume the feature layer rather than scraping raw API responses directly.

That shape is sound. It separates network acquisition, durable raw data, relational storage, feature engineering, and analysis. It is already more disciplined than a notebook-only data science project.

The main architectural problem is that the project is partway between "research scripts" and "operable data platform." The layers exist, but their contracts are not yet dependable:

Area Current Assessment Architectural Consequence
Configuration Centralized in intent, inconsistent in implementation Canonical scripts can fail before doing useful work
Schema management Migrations exist, but generated/legacy SQL and refresh SQL disagree Feature refreshes can fail against the current schema
Pipeline orchestration Ordered scripts work as a manual convention Recovery, repeatability, and partial reruns are fragile
Season handling 2026 logic is embedded through ingestion, schema, and models Historical and future seasons require code edits across layers
Feature correctness Useful temporal ideas, but at least one rolling window likely leaks current-match data Predictive measurements can be optimistic or misleading
Tests and reproducibility README exists, but it has drifted from source and dependency capture is incomplete New contributors and future reruns will be costly

The highest-value next move is not a large rewrite. It is to stabilize the contracts that already exist: settings, schema authority, API client behavior, feature correctness, and a minimal pipeline command surface.

Current System Shape

Repository Responsibilities

Subsystem Primary Responsibility Representative Files
Configuration Environment variables, directory locations, feature column lists config/database.py, config/directories.py, config/data_science.py
API URL construction Build FIRST and TBA endpoints api/frc_and_tba_api_url_builders.py
API fetching and raw storage Request JSON and save it under api/json/ api/pull_and_save_api_url_data.py, api/get_folder_file_both_from_api_url.py
Pull orchestration Batch districts, events, teams, and matches; write reference JSON api/scripts/
Relational loading Convert raw JSON into DataFrames and append to PostgreSQL sql/scripts/load_files/, sql/write_df_to_sql.py
JSON parsing Convert TBA match payloads to relational row structures sql/match_data_from_json.py
Schema authority Define public, features, and predictions database structure db/schema.sql, db/migrations/
SQL feature refresh Populate filtered events and denormalized match feature tables sql/features/load_files/
Feature engineering Derive phase columns and rolling autonomous scoring statistics data_science/data_science_functions.py, data_science/feature_eng/
Analytics consumers EDA dashboard and experimental model scripts data_science/eda/test_streamlit_server.py, data_science/lightgbm/first_try_light_gbm.py, data_science/test_opr.py

Current Data Flow

flowchart LR
    FRC[FIRST API] --> URLS[URL builders]
    TBA[The Blue Alliance API] --> URLS
    URLS --> PULL[Pull scripts and save helpers]
    PULL --> RAW[Raw JSON landing zone\napi/json]
    RAW --> REFS[Reference JSON\nkeys and page-derived indexes]
    RAW --> PYLOAD[Numbered Python loaders\nsql/scripts/load_files]
    REFS --> PULL
    PYLOAD --> CORE[(PostgreSQL public schema)]
    CORE --> SQLFEATURES[Ordered SQL feature jobs\nsql/features/load_files]
    SQLFEATURES --> FEATURES[(PostgreSQL features schema)]
    FEATURES --> ROLLING[Python rolling feature job]
    ROLLING --> FEATURES
    FEATURES --> EDA[Streamlit EDA]
    FEATURES --> MODELS[OPR, Lasso, LightGBM experiments]
Loading

The raw JSON handoff is a good decision. It allows ingestion and normalization to be debugged separately. It also provides a replay boundary so transformation bugs do not force repeated network calls.

The weaker part is orchestration. Many scripts execute work at import time rather than exposing reusable functions plus an explicit main() or command. The order is conveyed by filenames and README instructions, not by a testable pipeline abstraction.

Simplified Relational Shape

erDiagram
    DISTRICTS ||--o{ EVENTS : groups
    EVENTS ||--o{ MATCHES : hosts
    MATCHES ||--|{ MATCH_DATA_2026 : scores
    MATCHES ||--o{ MATCH_TEAMS : includes
    TEAMS ||--o{ MATCH_TEAMS : participates
    FILTERED_EVENTS ||--o{ MATCHES_EDA : filters_context
    MATCHES_EDA ||--o{ MATCH_AUTO_DATA_CALC : supports_features

    DISTRICTS {
        varchar district_key PK
        varchar district_code
        varchar district_name
    }
    EVENTS {
        varchar event_key PK
        int year
        varchar district_key FK
    }
    MATCHES {
        varchar match_key PK
        varchar event_key FK
        timestamp actual_time
    }
    MATCH_DATA_2026 {
        varchar match_key PK
        varchar alliance PK
        int auto_points
        int total_score
    }
    MATCH_TEAMS {
        varchar match_key PK
        varchar team_key PK
        varchar alliance
    }
    TEAMS {
        varchar team_key PK
        int team_number
    }
    FILTERED_EVENTS {
        varchar event_key PK
        int year
        int week
    }
    MATCHES_EDA {
        varchar match_key PK
        varchar alliance PK
        varchar team_key1
        varchar team_key2
        varchar team_key3
        int auto_points
    }
    MATCH_AUTO_DATA_CALC {
        varchar team_key PK
        varchar match_key PK
        int prev_match_count
    }
Loading

The public schema is sensibly normalized for core entities. The features schema is intentionally denormalized for EDA and modeling. That is a reasonable warehouse pattern as long as the refresh path, schema history, and feature semantics remain synchronized.

What the Codebase Already Does Well

1. It has a clear data lifecycle

The broad path from APIs to JSON to SQL to features to analysis is easy to explain. That matters. A new contributor can infer the purpose of most top-level directories without reading every script.

2. Raw JSON acts as an ingestion boundary

Files under api/json/ are ignored in .gitignore, which avoids committing large generated data. More importantly, the design acknowledges that external API responses are inputs worth retaining locally during development.

3. The relational design is more intentional than the script layout

db/schema.sql defines keys, foreign keys, and indexes. The split among matches, match_data_2026, and match_teams is useful: metadata, alliance-level scoring, and team participation are not flattened too early.

4. Feature tables are separated from source tables

Derived tables live in features, not public. This is a good governance instinct. It preserves the option to rebuild features from source facts and makes it easier to distinguish raw-ish facts from model-ready views.

5. Match payload parsing is centralized

sql/match_data_from_json.py concentrates the 2026 match parsing logic. That is much easier to maintain than duplicating score extraction inside every loader or model script.

Findings and Critique

Finding 1: The runtime configuration contract is internally inconsistent

Severity: High

Evidence:

Why it matters:

Configuration is a cross-cutting architectural contract. When it is inconsistent, every other subsystem appears unreliable even if its business logic is correct. The issue is not merely naming style; it affects path resolution, script importability, and setup instructions.

Recommendation:

Create one import-safe settings module with explicit fields for:

  • Repository root and all generated-data directories
  • Schema name
  • Season range or selected season set
  • Database DSN and explicit DB fields, with one documented source of truth
  • FIRST and TBA credentials

Validate settings once at command startup and fail with precise messages. Update the README from that contract, not independently of it.

Finding 2: Schema migrations and feature refresh SQL have drifted apart

Severity: High

Evidence:

Why it matters:

The feature refresh path is supposed to be a stable bridge from normalized facts to analysis tables. If it does not agree with the migration-backed schema, the pipeline can fail after doing ingestion and loading work successfully. Multiple schema authorities also increase the chance that future changes are made against the wrong definition.

Recommendation:

Treat migrations plus the generated db/schema.sql dump as the only schema authority. Either remove sql/SQL_Schema.sql, clearly mark it as archival, or regenerate it from the same canonical source. Add an integration check that applies migrations to an empty database and runs all feature SQL files in order.

Finding 3: Pipeline orchestration is convention-driven rather than command-driven

Severity: Medium-High

Evidence:

Why it matters:

Filename ordering is acceptable as a temporary migration pattern, but it becomes fragile once the project has partial reruns, failures halfway through a load, multiple seasons, or scheduled runs. The current design makes it hard to answer operational questions such as "Which stage completed?", "Can I rerun just features?", and "Is this table from the same raw-data snapshot as the model input?"

Recommendation:

Introduce a small CLI or task layer with stage-oriented commands, for example:

frc-data pull districts --season 2026
frc-data pull events --season 2026
frc-data load core --season 2026
frc-data build features --season 2026
frc-data train auto-model --season 2026

Each command should validate settings, log inputs and outputs, and define its rerun behavior. Existing scripts can be refactored into callable functions first, then wired behind that CLI without a disruptive reorganization.

Finding 4: The API layer lacks resilience and a single client abstraction

Severity: Medium-High

Evidence:

  • api/pull_and_save_api_url_data.py calls requests.get() directly in four code paths with no shared session, timeout, retry, or rate-limit handling.
  • Directory creation is not centralized before file writes.
  • Error handling mostly prints HTTP status and continues.
  • The debug branch in pull_and_save_teams_frc_api_data() can reference response-derived variables after a file-skip path where those variables were not assigned.
  • api/get_folder_file_both_from_api_url.py performs URL-to-folder routing procedurally. The TBA loop uses exit where a loop break appears to be intended, and call_folder is not initialized before conditional use.

Why it matters:

API behavior is outside the repository's control. Transient 429s, 5xx responses, invalid credentials, and partial network failures are routine. A resilient client layer prevents many operational failures from leaking into every orchestration script.

Recommendation:

Create a thin ApiClient boundary for FIRST and TBA with:

  • Session reuse
  • Request timeout policy
  • Bounded retries with backoff for retryable statuses
  • Explicit auth validation errors
  • Structured result objects or exceptions instead of print-only failure paths
  • A raw-store helper that creates directories and performs atomic-ish write patterns where appropriate

Unit-test URL routing and file-name generation separately from network calls.

Finding 5: Season-specific data shape is embedded across layers

Severity: Medium

Evidence:

Why it matters:

FRC scoring formats change by season, so some season specialization is appropriate. The architectural issue is not that 2026 logic exists; it is that season selection, parser choice, database naming, and experiment periods are scattered across unrelated layers. That makes the project harder to extend to 2027 or backfill older years without editing a wide swath of files.

Recommendation:

Separate season-invariant concepts from season-specific score adapters:

  • Keep matches, teams, events, and match_teams stable.
  • Represent game-specific score facts through versioned adapters and either season-specific tables behind a stable view or a normalized score-fact model with season metadata.
  • Move training cutoffs and active seasons into explicit experiment or pipeline configuration.

Finding 6: Rolling feature generation likely leaks current-match data

Severity: High for modeling correctness

Evidence:

  • data_science/feature_eng/auto_averages.py appends the current row's auto_points to team_auto_dict[current_team] before calculating historical statistics.
  • data_science/feature_eng/function_for_feature_eng.py calculates many window statistics from slices such as scores[-i:]. After the current score has been appended, those slices include the target match's value for common window sizes.
  • The trend path does not follow the same "last N prior matches" window convention as the other statistics, making feature semantics harder to reason about.

Why it matters:

Temporal features intended for prediction must only use information available before the match being predicted. Leakage can make model evaluation look stronger than reality and conceal genuine generalization problems.

Recommendation:

Define temporal feature semantics precisely: each feature row for match M may use only matches with actual_time < M.actual_time, with deterministic tie handling. Compute features from the prior history first, then append the current result to history. Add small fixture tests where the correct mean, max, count, and trend can be verified by hand.

Finding 7: The analytics layer is exploratory, but its boundaries are not labeled as such

Severity: Medium

Evidence:

  • data_science/lightgbm/first_try_light_gbm.py trains a model directly at script load, mixes model fitting with Streamlit plotting, and imports metrics/helpers that are not used.
  • data_science/test_opr.py contains both a manual OPR workflow and a Lasso experiment. The Lasso section calls model.fit(X_train, y_test) rather than pairing training features with training labels, and its feature matrix still contains team-key strings rather than a validated numeric encoding.
  • data_science/eda/test_streamlit_server.py is useful for investigation, but the README describes some behavior more strongly than the implementation currently provides.

Why it matters:

Exploratory scripts are normal in a data science repo. The architectural risk appears when experiments look like sanctioned pipeline stages but lack the protections expected of those stages. Readers may assume model metrics are reproducible, feature inputs are stable, or dashboard behavior matches documentation when those are still evolving.

Recommendation:

Make the distinction explicit:

  • Keep prototypes in an experiments or clearly labeled research area.
  • Promote stable workflows into parameterized modules with tests.
  • Store experiment configuration, metrics, and split definitions together.
  • Use the existing predictions schema only after prediction outputs have a documented contract.

Finding 8: Reproducibility and documentation have drifted from code

Severity: Medium

Evidence:

  • requirements.txt includes ingestion, SQL, plotting, and Streamlit dependencies, but active Python files import packages not listed there: lightgbm, scikit-learn, scipy, seaborn, and dython.
  • The README documents a legitimate end-to-end intent, but its configuration variable names do not match source code, its feature/model claims are ahead of implementation in places, and it does not mention the current schema/feature SQL mismatch.
  • No conventional automated test setup, CI workflow, or project package metadata was found alongside the implementation. The files named test_*.py are currently exploratory scripts rather than assertion-driven tests.

Why it matters:

A data platform is reproducible only when code, dependencies, schema, and operating instructions agree. Drift here increases onboarding cost and makes future results harder to trust.

Recommendation:

At minimum:

  • Split dependency capture into runtime, data-science, and development/test groups, or adopt a pyproject.toml-based toolchain.
  • Update README configuration and run commands from a single verified source of truth.
  • Add CI checks for importability, unit tests, migration application, and feature SQL execution against a small fixture database.

Suggested Target Architecture

The current code does not need to abandon its directory structure immediately. A practical target is to introduce clearer boundaries while preserving the existing data flow.

flowchart LR
    SETTINGS[Validated settings]
    CLI[CLI or task entrypoints]
    API[Typed API clients\nFIRST and TBA]
    RAW[Raw store\nJSON plus optional run manifest]
    PARSERS[Season-aware parsers]
    CORE[(Public relational tables)]
    MIGRATIONS[Migrations and schema dump\none authority]
    FEATURES[Feature builders\nSQL and Python]
    FEATUREDB[(Features schema)]
    EXPERIMENTS[Experiments and notebooks]
    TRAIN[Stable training jobs]
    PRED[(Predictions schema)]
    TESTS[Fixture and integration tests]

    SETTINGS --> CLI
    CLI --> API
    API --> RAW
    RAW --> PARSERS
    PARSERS --> CORE
    MIGRATIONS --> CORE
    CORE --> FEATURES
    FEATURES --> FEATUREDB
    FEATUREDB --> EXPERIMENTS
    FEATUREDB --> TRAIN
    TRAIN --> PRED
    TESTS --> API
    TESTS --> PARSERS
    TESTS --> FEATURES
Loading

The target architecture introduces three ideas that the current repository is ready for:

  1. Settings are validated once and then passed through the system, rather than rediscovered inconsistently.
  2. Season specialization is an adapter concern, not a scattered global assumption.
  3. Pipeline stages are commands with contracts, not only scripts that happen to run in a useful order.

Recommended Improvement Roadmap

Phase 1: Restore Contract Integrity

  1. Fix and consolidate settings: repository root, SCHEMA, season range, credentials, and database DSN.
  2. Synchronize README environment variable names with actual settings.
  3. Repair features.matches_eda refresh SQL so it matches current migrations.
  4. Retire, regenerate, or explicitly archive sql/SQL_Schema.sql.
  5. Add a tiny import/settings smoke test and a migration-plus-feature-SQL integration test.

Phase 2: Harden Ingestion and Loads

  1. Introduce FIRST and TBA client classes or functions with retries, timeouts, and consistent errors.
  2. Centralize raw-store directory and filename behavior.
  3. Refactor import-time scripts into functions plus command entrypoints.
  4. Define idempotency rules for each stage and record run metadata where useful.
  5. Replace ad hoc SQL identifier interpolation where external input could eventually reach query construction.

Phase 3: Make the Feature Layer Trustworthy

  1. Fix rolling feature windows so they are strictly historical.
  2. Add fixture tests for temporal features and phase transformation.
  3. Document feature definitions and table refresh semantics.
  4. Decide whether feature tables are always rebuildable, incrementally maintained, or split by category.

Phase 4: Mature the Analytics Surface

  1. Separate exploratory scripts from stable model pipelines.
  2. Parameterize active season, split date, target, and feature set.
  3. Capture missing data-science dependencies in environment tooling.
  4. Store stable prediction outputs only after the predictions schema contract is clear.

Recommended Test Matrix

Test Type Purpose Example Coverage
Settings smoke test Catch missing config exports and path mistakes REPO_ROOT, SCHEMA, season bounds, env-name mapping
URL routing unit tests Make raw-store filenames deterministic TBA district/event/match/team URLs and FIRST URLs
Parser fixture tests Lock down score extraction 2026 match JSON to expected public-table rows
Migration integration test Ensure schema builds from zero Apply migrations into empty PostgreSQL
Feature SQL integration test Prevent schema/SQL drift Run ordered sql/features/load_files scripts
Temporal feature test Prevent leakage Prior-match-only rolling stats
Pipeline rerun test Validate idempotency Load same fixture twice without duplicate facts
Model smoke test Separate experiment breakage from pipeline breakage Tiny feature table produces metrics without type or shape errors

Suggested Package Evolution

This is a direction, not a required immediate rewrite. The repo could gradually converge toward a structure like:

frc_data/
  settings.py
  cli.py
  api/
    clients.py
    routes.py
    raw_store.py
  ingest/
    districts.py
    events.py
    teams.py
    matches.py
  parsing/
    common.py
    season_2026.py
  db/
    connections.py
    repositories.py
  features/
    phase.py
    rolling_auto.py
experiments/
  opr.py
  lightgbm_auto.py
tests/
  fixtures/

The migration can be incremental: first extract stable functions from current scripts, then move only the functions that have proven reusable.

Closing Judgment

The project already expresses a good platform idea: use raw snapshots to feed a relational warehouse, then build ML-ready features and analytical tools on top. The work worth doing next is chiefly contract work. Make configuration truthful, schema evolution authoritative, API calls resilient, temporal features trustworthy, and pipeline stages explicit. Once those foundations are in place, adding seasons, stronger models, and richer analytics will become much cheaper and much safer.# FRC Data Science Project — Architecture & Code Review

A structured critique of the current codebase with diagrams, identified
issues, and prioritized recommendations. Prepared as a working document
for the maintainers; nothing in here changes runtime behavior.


1. Executive Summary

The project implements a classic data-pipeline architecture:
External APIs → JSON snapshots on disk → relational store (PostgreSQL) →
feature engineering → ML models + interactive EDA. The layering is
directionally correct and the use of pathlib, SQLAlchemy, and
dbmate-style migrations is solid.

However, the codebase shows symptoms of organic growth without
periodic consolidation
:

  • Several import-time failures (missing config constants).
  • Significant duplication in the API and ETL layers.
  • No error handling, retries, or logging anywhere in the pipeline.
  • Iterative pandas loops where vectorized operations would be 50–100× faster.
  • Train/test contamination in one of the ML scripts.
  • No automated tests.

The good news: the directory layout already suggests the correct
abstractions. Most fixes are mechanical refactors, not redesigns.


2. High-Level Architecture

flowchart TD
    subgraph EXT["External Services"]
        FRC["FRC API v3"]
        TBA["The Blue Alliance API v3"]
    end

    subgraph API["api/"]
        URL["frc_and_tba_api_url_builders.py"]
        CRED["get_credentials.py"]
        PATH["get_folder_file_both_from_api_url.py"]
        PULL["pull_and_save_api_url_data.py"]
        SCRIPTS["scripts/*_pull_save_create_reference.py"]
    end

    subgraph DISK["api/json/ - raw JSON cache"]
        D1["districts_data/"]
        D2["events_data/"]
        D3["matches_data/"]
        D4["teams_data/"]
        D5["reference_data/"]
    end

    subgraph SQL["sql/ - ETL + DB access"]
        LOAD["scripts/load_files/10..40_*.py"]
        TRANSFORM["match_data_from_json.py"]
        DBACCESS["data_science_db.py / load_db_to_df.py"]
        FEATSQL["features/load_files/*.sql"]
    end

    subgraph PG["PostgreSQL"]
        PUB["public.* - raw"]
        FEAT["features.* - engineered"]
        PRED["predictions.* - empty"]
    end

    subgraph DS["data_science/"]
        FE["feature_eng/auto_averages.py"]
        OPR["test_opr.py"]
        LGBM["lightgbm/first_try_light_gbm.py"]
        EDA["eda/test_streamlit_server.py"]
    end

    FRC --> URL
    TBA --> URL
    URL --> PULL
    CRED --> PULL
    PATH --> PULL
    PULL --> DISK
    SCRIPTS --> PULL

    DISK --> LOAD
    TRANSFORM --> LOAD
    LOAD --> PUB
    PUB --> FEATSQL --> FEAT

    FEAT --> FE --> FEAT
    FEAT --> OPR
    FEAT --> LGBM
    FEAT --> EDA
    DBACCESS -.-> SQL
    DBACCESS -.-> DS
Loading

Layer responsibilities

Layer Folder Responsibility
Config config/ Paths, DB URL, feature lists, year range
Ingestion api/ Build URLs, auth, fetch, persist JSON
Storage (raw) api/json/ + db/migrations/ JSON snapshots, schema versioning
ETL sql/scripts/load_files/ JSON → DataFrame → PostgreSQL
Feature eng. data_science/feature_eng/, sql/features/ Rolling stats, derived features
Modeling data_science/lightgbm/, data_science/test_opr.py LightGBM regressor, OPR
Exploration data_science/eda/ Streamlit dashboard

3. Module-by-Module Critique

3.1 config/

Centralizes paths and DB connection. Good use of pathlib and dotenv.

Issues

  • START_YEAR, END_YEAR, and SCHEMA are imported by api/scripts/ but never defined anywhere — these scripts cannot run as-is.
  • DB host/port are hard-coded to 127.0.0.1:5432 despite .env being loaded for everything else.
  • IMPORTANT_FEATURES is a long hand-edited Python list with no link to the DB schema; columns can silently drift.

3.2 api/

A reasonable URL-builder / fetcher split.

Issues

  • pull_and_save_api_url_data.py contains 3–4 near-identical pull_and_save_* functions — only headers/base URL differ.
  • get_folder_file_both_from_api_url.py reverse-engineers local file paths from URLs via regex + a walk to find .git. Brittle; a dict lookup of {endpoint_kind: folder} would replace it.
  • No try/except around requests.get, no retry, no rate limit, no status-code check beyond the obvious. A single transient 5xx aborts the whole multi-year pull.
  • first_api_bad_before_2020/ is named as if it were dead code but is still in the import graph.

3.3 sql/

DB access layer is the strongest part of the codebase: a shared SQLAlchemy ENGINE, context-managed connections, and a sensible read helper.

Issues

  • match_data_from_json.py has a function literally named matches_data_2026_from_json — the season is baked into the function name, meaning next year requires a copy-paste. The same function also assigns total_auto_points twice with different values.
  • scripts/load_files/40_create_matches_data_df_json_sql.py globs only *2026* JSON files.
  • Duplicate detection works by SELECT key FROM table into a Python set — this does not scale and ignores the obvious INSERT ... ON CONFLICT DO NOTHING.
  • write_df_to_sql.py calls an interactive confirm() from inside the write function, blocking any batch / scheduled use.
  • get_district_from_state_country.py hard-codes a partial US-state → district map in Python; this is data and belongs in a table.

3.4 db/

dbmate-style timestamped migrations — good.

Issues

  • 10 migrations in 3 days, several of which rename or remove columns added a day earlier — schema was clearly evolved at runtime rather than designed. Worth a consolidation pass before the next season.
  • Filename 20260512174020_add feature and predictions schema.sql contains a space — annoying for shell tools.

3.5 data_science/

feature_eng/auto_averages.py

  • Uses an explicit Python loop over matches to compute rolling means / max / min / std / trend. df.groupby('team_key').rolling(window=N) replaces ~80% of this code and runs orders of magnitude faster.
  • Import line from function_for_feature_eng import ... only works when CWD happens to be the right directory — should be from data_science.feature_eng.function_for_feature_eng import ....
  • Magic lookback windows [3, 5, 10, 'season'] should live in config/data_science.py.

test_opr.py

  • Sets up the OPR linear system correctly, but the Lasso section fits on X_train and then evaluates against y_test paired with predictions from X_traintrain/test leakage / mis-aligned shapes. Numbers reported from this script are not trustworthy.
  • No cross-validation; one fixed date split.

lightgbm/first_try_light_gbm.py

  • Hyperparameters hand-picked; the test set is also passed as eval_set during training (early-stopping on the test set ⇒ optimistic test metrics).
  • Categorical handling is implicit (LightGBM infers it). Make it explicit via categorical_feature=[...].

eda/test_streamlit_server.py

  • @st.cache_data with no TTL — dashboard shows stale data after DB writes until the Python process restarts.
  • Unused dython import; large block of commented-out code.

3.6 TestCode/ and root scripts

  • test.py / test_plot.ipynb are ad-hoc scratch — not pytest tests. There is no automated test suite at all.
  • confirm.py lives at repo root but is really a small utility; belongs in a utils/ package.
  • requirements.txt does not pin versions — reproducibility risk.
  • The file ystemctl list-units --type=service --state=running:q at the repo root looks like an accidental commit from a botched terminal command. Delete it.

4. Cross-Cutting Concerns

flowchart LR
    subgraph Concerns
        C1[Configuration]
        C2[Error handling]
        C3[Logging]
        C4[Testing]
        C5[Transactions]
        C6[Reproducibility]
    end
    C1 -->|missing consts| BUG1[Import failures]
    C2 -->|absent| BUG2[Whole pipeline aborts on 1 bad response]
    C3 -->|print only| BUG3[Hard to diagnose batch runs]
    C4 -->|none| BUG4[Refactors are scary]
    C5 -->|if_exists=append| BUG5[Partial loads on crash]
    C6 -->|unpinned deps| BUG6[Works on my machine]
Loading
Concern State Recommendation
Configuration Partial — paths good, constants missing Define START_YEAR, END_YEAR, SCHEMA; move all magic numbers in
Secrets OK — .env + base64 Add .env.example, document required vars
Error handling Absent Standard wrapper around requests with retry + status-code raise
Logging print only Adopt stdlib logging, one logger per module
Testing None pytest + fixtures for JSON samples; start with match_data_from_json
Transactions Append-only, no rollback Wrap each load in a transaction; use ON CONFLICT DO NOTHING
Naming Mostly snake_case, some inconsistencies (team_key1/2/3, year baked into names) Rename per-season helpers to take year; rename position-keyed columns to colour/role

5. Code Smells — Ranked

# Severity Location Smell
1 🔴 Critical api/scripts/*.py Imports of undefined START_YEAR, END_YEAR, SCHEMA
2 🔴 Critical data_science/test_opr.py Train/test shape mismatch in Lasso block
3 🔴 Critical data_science/feature_eng/auto_averages.py Broken relative import
4 🟠 High api/pull_and_save_api_url_data.py 3× near-duplicate fetch functions
5 🟠 High api/get_folder_file_both_from_api_url.py Regex + .git-walk to derive paths
6 🟠 High API layer No retries / status-check / rate limiting
7 🟠 High sql/get_district_from_state_country.py Hard-coded partial state map
8 🟡 Medium sql/match_data_from_json.py Season hard-coded into function name; duplicate field assignment
9 🟡 Medium sql/scripts/load_files/ Python-set duplicate detection instead of SQL upserts
10 🟡 Medium data_science/feature_eng/auto_averages.py Row-by-row pandas loops
11 🟡 Medium data_science/lightgbm/first_try_light_gbm.py Early-stopping on test set
12 🟡 Medium data_science/eda/test_streamlit_server.py Cache without TTL; dead imports
13 🟡 Medium sql/write_df_to_sql.py Interactive prompt inside writer
14 🟢 Low requirements.txt Unpinned versions
15 🟢 Low repo root Accidental file ystemctl ... :q
16 🟢 Low api/first_api_bad_before_2020/ Dead/archive code in live tree

6. Proposed Target Architecture

A relatively small rearrangement separates concerns the current layout already implies:

flowchart TD
    subgraph cfg["config/"]
        C["settings.py - pydantic-settings"]
    end
    subgraph ing["ingestion/ (was api/)"]
        CLI1["clients/frc.py"]
        CLI2["clients/tba.py"]
        BASE["clients/base.py - retry, auth, paging"]
        STORE["storage/json_cache.py"]
    end
    subgraph etl["etl/ (was sql/scripts/)"]
        PARSE["parsers/matches.py"]
        LOAD["loaders/postgres.py - upsert"]
    end
    subgraph db["db/"]
        MIG["migrations/"]
        REPO["repository/ - typed queries"]
    end
    subgraph feat["features/"]
        ROLL["rolling_stats.py - vectorized"]
        SQLFE["sql/*.sql"]
    end
    subgraph models["models/"]
        OPR2["opr.py"]
        GBM["lightgbm/"]
        TRAIN["train.py - CV, configs"]
    end
    subgraph apps["apps/"]
        DASH["dashboard/ - streamlit"]
    end
    subgraph tests["tests/"]
        UT["unit/"]
        IT["integration/"]
        FIX["fixtures/sample_*.json"]
    end

    C --> ing
    C --> etl
    C --> feat
    C --> models
    BASE --> CLI1
    BASE --> CLI2
    CLI1 --> STORE
    CLI2 --> STORE
    STORE --> PARSE
    PARSE --> LOAD
    LOAD --> REPO
    REPO --> feat
    feat --> models
    models --> apps
    tests -.-> ing
    tests -.-> etl
    tests -.-> feat
    tests -.-> models
Loading

Key differences from today:

  1. One BaseAPIClient with retry/auth/paging; FRC and TBA subclass it. Replaces all pull_and_save_* duplicates.
  2. JSON cache is a storage adapter, not coupled to URL regex parsing — the client tells the cache where to write.
  3. ETL parsers are pure functions (JSON in, records out) → trivially unit-testable.
  4. A small repository/ module wraps SQL queries used by feature/model code, so call sites don't write SQL strings.
  5. models/train.py is configuration-driven (YAML or pydantic) so OPR and LightGBM share a single train/eval loop with proper CV.
  6. tests/ mirrors source layout.

7. Prioritized Roadmap

Tier 1 — Make it run correctly (hours)

  1. Add START_YEAR, END_YEAR, SCHEMA to config/general.py.
  2. Fix the import path in data_science/feature_eng/auto_averages.py.
  3. Fix the OPR/Lasso train/test alignment in data_science/test_opr.py.
  4. Delete the stray ystemctl ... file at repo root.
  5. Pin requirements.txt (e.g. pip freeze > requirements.lock.txt).

Tier 2 — Remove duplication & brittleness (days)

  1. Collapse the pull_and_save_* family into a single client with auth strategies.
  2. Replace regex path inference with explicit endpoint → folder mapping.
  3. Parameterize matches_data_2026_from_json by year; pick parser from a {year: parser} registry.
  4. Switch ETL inserts to INSERT ... ON CONFLICT DO NOTHING and drop the in-Python set check.
  5. Move the state → district mapping into a seed table or a JSON in config/.

Tier 3 — Robustness & observability (days)

  1. Add requests-retry (e.g. urllib3.Retry adapter) with backoff and status-code checks.
  2. Introduce logging (one module-level logger) and replace print.
  3. Stand up pytest with fixtures for one JSON sample per entity; cover parsers first.
  4. Wrap each loader in an explicit DB transaction; remove the interactive prompt from write_df_to_postgres (move it to the calling CLI).
  5. Add ttl= to Streamlit caches.

Tier 4 — Modeling discipline (days)

  1. Vectorize rolling stats with groupby().rolling().
  2. Add a held-out validation split distinct from the test set; stop using test as eval_set.
  3. Make hyperparameters and feature lists config-driven; record run metadata.
  4. Add a baseline model (mean / per-team mean) to anchor metric comparisons.

Tier 5 — Polish (days)

  1. Move confirm.py into a utils/ package; add type hints; run ruff + mypy in CI.
  2. Write a real README.md (link to this doc; include make/uv commands).
  3. Archive api/first_api_bad_before_2020/ outside the import tree.

8. What's Already Good

  • Clear intent of layering (ingestion / storage / features / models / apps).
  • Migrations under version control with dbmate-style naming.
  • Centralized DB engine with context managers.
  • Separation of raw (public) vs engineered (features) vs serving (predictions) schemas — exactly the right shape.
  • Two complementary modeling approaches (OPR linear, GBM nonlinear) — good for sanity-checking.
  • Streamlit-based EDA next to the data, not in notebooks scattered on a laptop.

The bones are right. The work is mostly tightening the joints rather than rebuilding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions