You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This review examines the repository as an evolving analytics platform rather than as a set of isolated scripts. The focus is structure, architecture, data contracts, operational reliability, and the path from exploratory work to a repeatable pipeline.
The review was based on direct inspection of the current source tree, including:
This was a source and architecture review. It did not execute API pulls, migrations, or model training. Editor diagnostics were clean for the inspected Python folders, but several findings below are runtime-contract or data-correctness issues that static diagnostics do not catch.
Executive Assessment
The repository has a promising core shape:
External FRC data is persisted as raw JSON before transformation.
JSON data is normalized into relational PostgreSQL tables.
Feature tables are derived into a dedicated features schema.
Modeling and EDA consume the feature layer rather than scraping raw API responses directly.
That shape is sound. It separates network acquisition, durable raw data, relational storage, feature engineering, and analysis. It is already more disciplined than a notebook-only data science project.
The main architectural problem is that the project is partway between "research scripts" and "operable data platform." The layers exist, but their contracts are not yet dependable:
Area
Current Assessment
Architectural Consequence
Configuration
Centralized in intent, inconsistent in implementation
Canonical scripts can fail before doing useful work
Schema management
Migrations exist, but generated/legacy SQL and refresh SQL disagree
Feature refreshes can fail against the current schema
Pipeline orchestration
Ordered scripts work as a manual convention
Recovery, repeatability, and partial reruns are fragile
Season handling
2026 logic is embedded through ingestion, schema, and models
Historical and future seasons require code edits across layers
Feature correctness
Useful temporal ideas, but at least one rolling window likely leaks current-match data
Predictive measurements can be optimistic or misleading
Tests and reproducibility
README exists, but it has drifted from source and dependency capture is incomplete
New contributors and future reruns will be costly
The highest-value next move is not a large rewrite. It is to stabilize the contracts that already exist: settings, schema authority, API client behavior, feature correctness, and a minimal pipeline command surface.
flowchart LR
FRC[FIRST API] --> URLS[URL builders]
TBA[The Blue Alliance API] --> URLS
URLS --> PULL[Pull scripts and save helpers]
PULL --> RAW[Raw JSON landing zone\napi/json]
RAW --> REFS[Reference JSON\nkeys and page-derived indexes]
RAW --> PYLOAD[Numbered Python loaders\nsql/scripts/load_files]
REFS --> PULL
PYLOAD --> CORE[(PostgreSQL public schema)]
CORE --> SQLFEATURES[Ordered SQL feature jobs\nsql/features/load_files]
SQLFEATURES --> FEATURES[(PostgreSQL features schema)]
FEATURES --> ROLLING[Python rolling feature job]
ROLLING --> FEATURES
FEATURES --> EDA[Streamlit EDA]
FEATURES --> MODELS[OPR, Lasso, LightGBM experiments]
Loading
The raw JSON handoff is a good decision. It allows ingestion and normalization to be debugged separately. It also provides a replay boundary so transformation bugs do not force repeated network calls.
The weaker part is orchestration. Many scripts execute work at import time rather than exposing reusable functions plus an explicit main() or command. The order is conveyed by filenames and README instructions, not by a testable pipeline abstraction.
Simplified Relational Shape
erDiagram
DISTRICTS ||--o{ EVENTS : groups
EVENTS ||--o{ MATCHES : hosts
MATCHES ||--|{ MATCH_DATA_2026 : scores
MATCHES ||--o{ MATCH_TEAMS : includes
TEAMS ||--o{ MATCH_TEAMS : participates
FILTERED_EVENTS ||--o{ MATCHES_EDA : filters_context
MATCHES_EDA ||--o{ MATCH_AUTO_DATA_CALC : supports_features
DISTRICTS {
varchar district_key PK
varchar district_code
varchar district_name
}
EVENTS {
varchar event_key PK
int year
varchar district_key FK
}
MATCHES {
varchar match_key PK
varchar event_key FK
timestamp actual_time
}
MATCH_DATA_2026 {
varchar match_key PK
varchar alliance PK
int auto_points
int total_score
}
MATCH_TEAMS {
varchar match_key PK
varchar team_key PK
varchar alliance
}
TEAMS {
varchar team_key PK
int team_number
}
FILTERED_EVENTS {
varchar event_key PK
int year
int week
}
MATCHES_EDA {
varchar match_key PK
varchar alliance PK
varchar team_key1
varchar team_key2
varchar team_key3
int auto_points
}
MATCH_AUTO_DATA_CALC {
varchar team_key PK
varchar match_key PK
int prev_match_count
}
Loading
The public schema is sensibly normalized for core entities. The features schema is intentionally denormalized for EDA and modeling. That is a reasonable warehouse pattern as long as the refresh path, schema history, and feature semantics remain synchronized.
What the Codebase Already Does Well
1. It has a clear data lifecycle
The broad path from APIs to JSON to SQL to features to analysis is easy to explain. That matters. A new contributor can infer the purpose of most top-level directories without reading every script.
2. Raw JSON acts as an ingestion boundary
Files under api/json/ are ignored in .gitignore, which avoids committing large generated data. More importantly, the design acknowledges that external API responses are inputs worth retaining locally during development.
3. The relational design is more intentional than the script layout
db/schema.sql defines keys, foreign keys, and indexes. The split among matches, match_data_2026, and match_teams is useful: metadata, alliance-level scoring, and team participation are not flattened too early.
4. Feature tables are separated from source tables
Derived tables live in features, not public. This is a good governance instinct. It preserves the option to rebuild features from source facts and makes it easier to distinguish raw-ish facts from model-ready views.
5. Match payload parsing is centralized
sql/match_data_from_json.py concentrates the 2026 match parsing logic. That is much easier to maintain than duplicating score extraction inside every loader or model script.
Findings and Critique
Finding 1: The runtime configuration contract is internally inconsistent
config/directories.py defines REPO_ROOT = Path(__file__).resolve().parent, which resolves to the config directory. It then builds paths such as REPO_ROOT / 'api' / 'json', which points to config/api/json rather than the repository's api/json directory.
The environment-variable names documented in README.md do not match the implementation in config/database.py and api/get_credentials.py. For example, code expects POSTGRESQL_USERNAME, POSTGRESQL_PASSWORD, POSTGRESQL_DB_FRCDATASCIENCE, DATABASE_URL, and THE_BLUE_ALLIANCE_API_KEY, while the README describes a different naming scheme.
Why it matters:
Configuration is a cross-cutting architectural contract. When it is inconsistent, every other subsystem appears unreliable even if its business logic is correct. The issue is not merely naming style; it affects path resolution, script importability, and setup instructions.
Recommendation:
Create one import-safe settings module with explicit fields for:
Repository root and all generated-data directories
Schema name
Season range or selected season set
Database DSN and explicit DB fields, with one documented source of truth
FIRST and TBA credentials
Validate settings once at command startup and fail with precise messages. Update the README from that contract, not independently of it.
Finding 2: Schema migrations and feature refresh SQL have drifted apart
sql/SQL_Schema.sql is an older generated schema artifact with a materially different shape from db/schema.sql. It omits newer schema evolution and still represents older relationships.
Why it matters:
The feature refresh path is supposed to be a stable bridge from normalized facts to analysis tables. If it does not agree with the migration-backed schema, the pipeline can fail after doing ingestion and loading work successfully. Multiple schema authorities also increase the chance that future changes are made against the wrong definition.
Recommendation:
Treat migrations plus the generated db/schema.sql dump as the only schema authority. Either remove sql/SQL_Schema.sql, clearly mark it as archival, or regenerate it from the same canonical source. Add an integration check that applies migrations to an empty database and runs all feature SQL files in order.
Finding 3: Pipeline orchestration is convention-driven rather than command-driven
Severity: Medium-High
Evidence:
sql/scripts/load_all_table_data.py relies on lexicographically sorted filenames and subprocess calls to execute all Python and SQL steps.
API scripts perform work at module import time rather than exposing side-effect-free functions plus explicit entrypoints.
Reference JSON files operate like manifests, but there is no explicit stage-status model or run metadata.
Why it matters:
Filename ordering is acceptable as a temporary migration pattern, but it becomes fragile once the project has partial reruns, failures halfway through a load, multiple seasons, or scheduled runs. The current design makes it hard to answer operational questions such as "Which stage completed?", "Can I rerun just features?", and "Is this table from the same raw-data snapshot as the model input?"
Recommendation:
Introduce a small CLI or task layer with stage-oriented commands, for example:
Each command should validate settings, log inputs and outputs, and define its rerun behavior. Existing scripts can be refactored into callable functions first, then wired behind that CLI without a disruptive reorganization.
Finding 4: The API layer lacks resilience and a single client abstraction
Severity: Medium-High
Evidence:
api/pull_and_save_api_url_data.py calls requests.get() directly in four code paths with no shared session, timeout, retry, or rate-limit handling.
Directory creation is not centralized before file writes.
Error handling mostly prints HTTP status and continues.
The debug branch in pull_and_save_teams_frc_api_data() can reference response-derived variables after a file-skip path where those variables were not assigned.
api/get_folder_file_both_from_api_url.py performs URL-to-folder routing procedurally. The TBA loop uses exit where a loop break appears to be intended, and call_folder is not initialized before conditional use.
Why it matters:
API behavior is outside the repository's control. Transient 429s, 5xx responses, invalid credentials, and partial network failures are routine. A resilient client layer prevents many operational failures from leaking into every orchestration script.
Recommendation:
Create a thin ApiClient boundary for FIRST and TBA with:
Session reuse
Request timeout policy
Bounded retries with backoff for retryable statuses
Explicit auth validation errors
Structured result objects or exceptions instead of print-only failure paths
A raw-store helper that creates directories and performs atomic-ish write patterns where appropriate
Unit-test URL routing and file-name generation separately from network calls.
Finding 5: Season-specific data shape is embedded across layers
Severity: Medium
Evidence:
The schema uses public.match_data_2026 in db/schema.sql.
FRC scoring formats change by season, so some season specialization is appropriate. The architectural issue is not that 2026 logic exists; it is that season selection, parser choice, database naming, and experiment periods are scattered across unrelated layers. That makes the project harder to extend to 2027 or backfill older years without editing a wide swath of files.
Recommendation:
Separate season-invariant concepts from season-specific score adapters:
Keep matches, teams, events, and match_teams stable.
Represent game-specific score facts through versioned adapters and either season-specific tables behind a stable view or a normalized score-fact model with season metadata.
Move training cutoffs and active seasons into explicit experiment or pipeline configuration.
Finding 6: Rolling feature generation likely leaks current-match data
data_science/feature_eng/function_for_feature_eng.py calculates many window statistics from slices such as scores[-i:]. After the current score has been appended, those slices include the target match's value for common window sizes.
The trend path does not follow the same "last N prior matches" window convention as the other statistics, making feature semantics harder to reason about.
Why it matters:
Temporal features intended for prediction must only use information available before the match being predicted. Leakage can make model evaluation look stronger than reality and conceal genuine generalization problems.
Recommendation:
Define temporal feature semantics precisely: each feature row for match M may use only matches with actual_time < M.actual_time, with deterministic tie handling. Compute features from the prior history first, then append the current result to history. Add small fixture tests where the correct mean, max, count, and trend can be verified by hand.
Finding 7: The analytics layer is exploratory, but its boundaries are not labeled as such
Severity: Medium
Evidence:
data_science/lightgbm/first_try_light_gbm.py trains a model directly at script load, mixes model fitting with Streamlit plotting, and imports metrics/helpers that are not used.
data_science/test_opr.py contains both a manual OPR workflow and a Lasso experiment. The Lasso section calls model.fit(X_train, y_test) rather than pairing training features with training labels, and its feature matrix still contains team-key strings rather than a validated numeric encoding.
data_science/eda/test_streamlit_server.py is useful for investigation, but the README describes some behavior more strongly than the implementation currently provides.
Why it matters:
Exploratory scripts are normal in a data science repo. The architectural risk appears when experiments look like sanctioned pipeline stages but lack the protections expected of those stages. Readers may assume model metrics are reproducible, feature inputs are stable, or dashboard behavior matches documentation when those are still evolving.
Recommendation:
Make the distinction explicit:
Keep prototypes in an experiments or clearly labeled research area.
Promote stable workflows into parameterized modules with tests.
Store experiment configuration, metrics, and split definitions together.
Use the existing predictions schema only after prediction outputs have a documented contract.
Finding 8: Reproducibility and documentation have drifted from code
Severity: Medium
Evidence:
requirements.txt includes ingestion, SQL, plotting, and Streamlit dependencies, but active Python files import packages not listed there: lightgbm, scikit-learn, scipy, seaborn, and dython.
The README documents a legitimate end-to-end intent, but its configuration variable names do not match source code, its feature/model claims are ahead of implementation in places, and it does not mention the current schema/feature SQL mismatch.
No conventional automated test setup, CI workflow, or project package metadata was found alongside the implementation. The files named test_*.py are currently exploratory scripts rather than assertion-driven tests.
Why it matters:
A data platform is reproducible only when code, dependencies, schema, and operating instructions agree. Drift here increases onboarding cost and makes future results harder to trust.
Recommendation:
At minimum:
Split dependency capture into runtime, data-science, and development/test groups, or adopt a pyproject.toml-based toolchain.
Update README configuration and run commands from a single verified source of truth.
Add CI checks for importability, unit tests, migration application, and feature SQL execution against a small fixture database.
Suggested Target Architecture
The current code does not need to abandon its directory structure immediately. A practical target is to introduce clearer boundaries while preserving the existing data flow.
flowchart LR
SETTINGS[Validated settings]
CLI[CLI or task entrypoints]
API[Typed API clients\nFIRST and TBA]
RAW[Raw store\nJSON plus optional run manifest]
PARSERS[Season-aware parsers]
CORE[(Public relational tables)]
MIGRATIONS[Migrations and schema dump\none authority]
FEATURES[Feature builders\nSQL and Python]
FEATUREDB[(Features schema)]
EXPERIMENTS[Experiments and notebooks]
TRAIN[Stable training jobs]
PRED[(Predictions schema)]
TESTS[Fixture and integration tests]
SETTINGS --> CLI
CLI --> API
API --> RAW
RAW --> PARSERS
PARSERS --> CORE
MIGRATIONS --> CORE
CORE --> FEATURES
FEATURES --> FEATUREDB
FEATUREDB --> EXPERIMENTS
FEATUREDB --> TRAIN
TRAIN --> PRED
TESTS --> API
TESTS --> PARSERS
TESTS --> FEATURES
Loading
The target architecture introduces three ideas that the current repository is ready for:
Settings are validated once and then passed through the system, rather than rediscovered inconsistently.
Season specialization is an adapter concern, not a scattered global assumption.
Pipeline stages are commands with contracts, not only scripts that happen to run in a useful order.
Recommended Improvement Roadmap
Phase 1: Restore Contract Integrity
Fix and consolidate settings: repository root, SCHEMA, season range, credentials, and database DSN.
Synchronize README environment variable names with actual settings.
Repair features.matches_eda refresh SQL so it matches current migrations.
The migration can be incremental: first extract stable functions from current scripts, then move only the functions that have proven reusable.
Closing Judgment
The project already expresses a good platform idea: use raw snapshots to feed a relational warehouse, then build ML-ready features and analytical tools on top. The work worth doing next is chiefly contract work. Make configuration truthful, schema evolution authoritative, API calls resilient, temporal features trustworthy, and pipeline stages explicit. Once those foundations are in place, adding seasons, stronger models, and richer analytics will become much cheaper and much safer.# FRC Data Science Project — Architecture & Code Review
A structured critique of the current codebase with diagrams, identified
issues, and prioritized recommendations. Prepared as a working document
for the maintainers; nothing in here changes runtime behavior.
1. Executive Summary
The project implements a classic data-pipeline architecture:
External APIs → JSON snapshots on disk → relational store (PostgreSQL) →
feature engineering → ML models + interactive EDA. The layering is directionally correct and the use of pathlib, SQLAlchemy, and dbmate-style migrations is solid.
However, the codebase shows symptoms of organic growth without
periodic consolidation:
Several import-time failures (missing config constants).
Significant duplication in the API and ETL layers.
No error handling, retries, or logging anywhere in the pipeline.
Iterative pandas loops where vectorized operations would be 50–100× faster.
Train/test contamination in one of the ML scripts.
No automated tests.
The good news: the directory layout already suggests the correct
abstractions. Most fixes are mechanical refactors, not redesigns.
2. High-Level Architecture
flowchart TD
subgraph EXT["External Services"]
FRC["FRC API v3"]
TBA["The Blue Alliance API v3"]
end
subgraph API["api/"]
URL["frc_and_tba_api_url_builders.py"]
CRED["get_credentials.py"]
PATH["get_folder_file_both_from_api_url.py"]
PULL["pull_and_save_api_url_data.py"]
SCRIPTS["scripts/*_pull_save_create_reference.py"]
end
subgraph DISK["api/json/ - raw JSON cache"]
D1["districts_data/"]
D2["events_data/"]
D3["matches_data/"]
D4["teams_data/"]
D5["reference_data/"]
end
subgraph SQL["sql/ - ETL + DB access"]
LOAD["scripts/load_files/10..40_*.py"]
TRANSFORM["match_data_from_json.py"]
DBACCESS["data_science_db.py / load_db_to_df.py"]
FEATSQL["features/load_files/*.sql"]
end
subgraph PG["PostgreSQL"]
PUB["public.* - raw"]
FEAT["features.* - engineered"]
PRED["predictions.* - empty"]
end
subgraph DS["data_science/"]
FE["feature_eng/auto_averages.py"]
OPR["test_opr.py"]
LGBM["lightgbm/first_try_light_gbm.py"]
EDA["eda/test_streamlit_server.py"]
end
FRC --> URL
TBA --> URL
URL --> PULL
CRED --> PULL
PATH --> PULL
PULL --> DISK
SCRIPTS --> PULL
DISK --> LOAD
TRANSFORM --> LOAD
LOAD --> PUB
PUB --> FEATSQL --> FEAT
FEAT --> FE --> FEAT
FEAT --> OPR
FEAT --> LGBM
FEAT --> EDA
DBACCESS -.-> SQL
DBACCESS -.-> DS
get_folder_file_both_from_api_url.py reverse-engineers local file paths from URLs via regex + a walk to find .git. Brittle; a dict lookup of {endpoint_kind: folder} would replace it.
No try/except around requests.get, no retry, no rate limit, no status-code check beyond the obvious. A single transient 5xx aborts the whole multi-year pull.
first_api_bad_before_2020/ is named as if it were dead code but is still in the import graph.
3.3 sql/
DB access layer is the strongest part of the codebase: a shared SQLAlchemy ENGINE, context-managed connections, and a sensible read helper.
Issues
match_data_from_json.py has a function literally named matches_data_2026_from_json — the season is baked into the function name, meaning next year requires a copy-paste. The same function also assigns total_auto_points twice with different values.
10 migrations in 3 days, several of which rename or remove columns added a day earlier — schema was clearly evolved at runtime rather than designed. Worth a consolidation pass before the next season.
Filename 20260512174020_add feature and predictions schema.sql contains a space — annoying for shell tools.
3.5 data_science/
feature_eng/auto_averages.py
Uses an explicit Python loop over matches to compute rolling means / max / min / std / trend. df.groupby('team_key').rolling(window=N) replaces ~80% of this code and runs orders of magnitude faster.
Import line from function_for_feature_eng import ... only works when CWD happens to be the right directory — should be from data_science.feature_eng.function_for_feature_eng import ....
Magic lookback windows [3, 5, 10, 'season'] should live in config/data_science.py.
test_opr.py
Sets up the OPR linear system correctly, but the Lasso section fits on X_train and then evaluates against y_test paired with predictions from X_train — train/test leakage / mis-aligned shapes. Numbers reported from this script are not trustworthy.
No cross-validation; one fixed date split.
lightgbm/first_try_light_gbm.py
Hyperparameters hand-picked; the test set is also passed as eval_set during training (early-stopping on the test set ⇒ optimistic test metrics).
Categorical handling is implicit (LightGBM infers it). Make it explicit via categorical_feature=[...].
eda/test_streamlit_server.py
@st.cache_data with no TTL — dashboard shows stale data after DB writes until the Python process restarts.
Unused dython import; large block of commented-out code.
3.6 TestCode/ and root scripts
test.py / test_plot.ipynb are ad-hoc scratch — not pytest tests. There is no automated test suite at all.
confirm.py lives at repo root but is really a small utility; belongs in a utils/ package.
requirements.txt does not pin versions — reproducibility risk.
The file ystemctl list-units --type=service --state=running:q at the repo root looks like an accidental commit from a botched terminal command. Delete it.
4. Cross-Cutting Concerns
flowchart LR
subgraph Concerns
C1[Configuration]
C2[Error handling]
C3[Logging]
C4[Testing]
C5[Transactions]
C6[Reproducibility]
end
C1 -->|missing consts| BUG1[Import failures]
C2 -->|absent| BUG2[Whole pipeline aborts on 1 bad response]
C3 -->|print only| BUG3[Hard to diagnose batch runs]
C4 -->|none| BUG4[Refactors are scary]
C5 -->|if_exists=append| BUG5[Partial loads on crash]
C6 -->|unpinned deps| BUG6[Works on my machine]
Loading
Concern
State
Recommendation
Configuration
Partial — paths good, constants missing
Define START_YEAR, END_YEAR, SCHEMA; move all magic numbers in
Secrets
OK — .env + base64
Add .env.example, document required vars
Error handling
Absent
Standard wrapper around requests with retry + status-code raise
Logging
print only
Adopt stdlib logging, one logger per module
Testing
None
pytest + fixtures for JSON samples; start with match_data_from_json
Transactions
Append-only, no rollback
Wrap each load in a transaction; use ON CONFLICT DO NOTHING
Naming
Mostly snake_case, some inconsistencies (team_key1/2/3, year baked into names)
Rename per-season helpers to take year; rename position-keyed columns to colour/role
Architecture Review: FRC Data Science Project
Review Scope
This review examines the repository as an evolving analytics platform rather than as a set of isolated scripts. The focus is structure, architecture, data contracts, operational reliability, and the path from exploratory work to a repeatable pipeline.
The review was based on direct inspection of the current source tree, including:
This was a source and architecture review. It did not execute API pulls, migrations, or model training. Editor diagnostics were clean for the inspected Python folders, but several findings below are runtime-contract or data-correctness issues that static diagnostics do not catch.
Executive Assessment
The repository has a promising core shape:
featuresschema.That shape is sound. It separates network acquisition, durable raw data, relational storage, feature engineering, and analysis. It is already more disciplined than a notebook-only data science project.
The main architectural problem is that the project is partway between "research scripts" and "operable data platform." The layers exist, but their contracts are not yet dependable:
The highest-value next move is not a large rewrite. It is to stabilize the contracts that already exist: settings, schema authority, API client behavior, feature correctness, and a minimal pipeline command surface.
Current System Shape
Repository Responsibilities
api/json/Current Data Flow
flowchart LR FRC[FIRST API] --> URLS[URL builders] TBA[The Blue Alliance API] --> URLS URLS --> PULL[Pull scripts and save helpers] PULL --> RAW[Raw JSON landing zone\napi/json] RAW --> REFS[Reference JSON\nkeys and page-derived indexes] RAW --> PYLOAD[Numbered Python loaders\nsql/scripts/load_files] REFS --> PULL PYLOAD --> CORE[(PostgreSQL public schema)] CORE --> SQLFEATURES[Ordered SQL feature jobs\nsql/features/load_files] SQLFEATURES --> FEATURES[(PostgreSQL features schema)] FEATURES --> ROLLING[Python rolling feature job] ROLLING --> FEATURES FEATURES --> EDA[Streamlit EDA] FEATURES --> MODELS[OPR, Lasso, LightGBM experiments]The raw JSON handoff is a good decision. It allows ingestion and normalization to be debugged separately. It also provides a replay boundary so transformation bugs do not force repeated network calls.
The weaker part is orchestration. Many scripts execute work at import time rather than exposing reusable functions plus an explicit
main()or command. The order is conveyed by filenames and README instructions, not by a testable pipeline abstraction.Simplified Relational Shape
erDiagram DISTRICTS ||--o{ EVENTS : groups EVENTS ||--o{ MATCHES : hosts MATCHES ||--|{ MATCH_DATA_2026 : scores MATCHES ||--o{ MATCH_TEAMS : includes TEAMS ||--o{ MATCH_TEAMS : participates FILTERED_EVENTS ||--o{ MATCHES_EDA : filters_context MATCHES_EDA ||--o{ MATCH_AUTO_DATA_CALC : supports_features DISTRICTS { varchar district_key PK varchar district_code varchar district_name } EVENTS { varchar event_key PK int year varchar district_key FK } MATCHES { varchar match_key PK varchar event_key FK timestamp actual_time } MATCH_DATA_2026 { varchar match_key PK varchar alliance PK int auto_points int total_score } MATCH_TEAMS { varchar match_key PK varchar team_key PK varchar alliance } TEAMS { varchar team_key PK int team_number } FILTERED_EVENTS { varchar event_key PK int year int week } MATCHES_EDA { varchar match_key PK varchar alliance PK varchar team_key1 varchar team_key2 varchar team_key3 int auto_points } MATCH_AUTO_DATA_CALC { varchar team_key PK varchar match_key PK int prev_match_count }The public schema is sensibly normalized for core entities. The features schema is intentionally denormalized for EDA and modeling. That is a reasonable warehouse pattern as long as the refresh path, schema history, and feature semantics remain synchronized.
What the Codebase Already Does Well
1. It has a clear data lifecycle
The broad path from APIs to JSON to SQL to features to analysis is easy to explain. That matters. A new contributor can infer the purpose of most top-level directories without reading every script.
2. Raw JSON acts as an ingestion boundary
Files under
api/json/are ignored in.gitignore, which avoids committing large generated data. More importantly, the design acknowledges that external API responses are inputs worth retaining locally during development.3. The relational design is more intentional than the script layout
db/schema.sql defines keys, foreign keys, and indexes. The split among
matches,match_data_2026, andmatch_teamsis useful: metadata, alliance-level scoring, and team participation are not flattened too early.4. Feature tables are separated from source tables
Derived tables live in
features, notpublic. This is a good governance instinct. It preserves the option to rebuild features from source facts and makes it easier to distinguish raw-ish facts from model-ready views.5. Match payload parsing is centralized
sql/match_data_from_json.py concentrates the 2026 match parsing logic. That is much easier to maintain than duplicating score extraction inside every loader or model script.
Findings and Critique
Finding 1: The runtime configuration contract is internally inconsistent
Severity: High
Evidence:
START_YEARandEND_YEARfromconfig, but the inspected config modules only exposeCURR_YEARin config/general.py.SCHEMAfromconfig, including sql/scripts/load_files/40_create_matches_data_df_json_sql.py, but no config assignment forSCHEMAwas found.REPO_ROOT = Path(__file__).resolve().parent, which resolves to theconfigdirectory. It then builds paths such asREPO_ROOT / 'api' / 'json', which points toconfig/api/jsonrather than the repository'sapi/jsondirectory..gitis found.POSTGRESQL_USERNAME,POSTGRESQL_PASSWORD,POSTGRESQL_DB_FRCDATASCIENCE,DATABASE_URL, andTHE_BLUE_ALLIANCE_API_KEY, while the README describes a different naming scheme.Why it matters:
Configuration is a cross-cutting architectural contract. When it is inconsistent, every other subsystem appears unreliable even if its business logic is correct. The issue is not merely naming style; it affects path resolution, script importability, and setup instructions.
Recommendation:
Create one import-safe settings module with explicit fields for:
Validate settings once at command startup and fail with precise messages. Update the README from that contract, not independently of it.
Finding 2: Schema migrations and feature refresh SQL have drifted apart
Severity: High
Evidence:
district_keyandpredicted_timefromfeatures.matches_eda.district_keyandpredicted_time, which no longer exist in the table.Why it matters:
The feature refresh path is supposed to be a stable bridge from normalized facts to analysis tables. If it does not agree with the migration-backed schema, the pipeline can fail after doing ingestion and loading work successfully. Multiple schema authorities also increase the chance that future changes are made against the wrong definition.
Recommendation:
Treat migrations plus the generated db/schema.sql dump as the only schema authority. Either remove sql/SQL_Schema.sql, clearly mark it as archival, or regenerate it from the same canonical source. Add an integration check that applies migrations to an empty database and runs all feature SQL files in order.
Finding 3: Pipeline orchestration is convention-driven rather than command-driven
Severity: Medium-High
Evidence:
Why it matters:
Filename ordering is acceptable as a temporary migration pattern, but it becomes fragile once the project has partial reruns, failures halfway through a load, multiple seasons, or scheduled runs. The current design makes it hard to answer operational questions such as "Which stage completed?", "Can I rerun just features?", and "Is this table from the same raw-data snapshot as the model input?"
Recommendation:
Introduce a small CLI or task layer with stage-oriented commands, for example:
Each command should validate settings, log inputs and outputs, and define its rerun behavior. Existing scripts can be refactored into callable functions first, then wired behind that CLI without a disruptive reorganization.
Finding 4: The API layer lacks resilience and a single client abstraction
Severity: Medium-High
Evidence:
requests.get()directly in four code paths with no shared session, timeout, retry, or rate-limit handling.pull_and_save_teams_frc_api_data()can reference response-derived variables after a file-skip path where those variables were not assigned.exitwhere a loop break appears to be intended, andcall_folderis not initialized before conditional use.Why it matters:
API behavior is outside the repository's control. Transient 429s, 5xx responses, invalid credentials, and partial network failures are routine. A resilient client layer prevents many operational failures from leaking into every orchestration script.
Recommendation:
Create a thin
ApiClientboundary for FIRST and TBA with:Unit-test URL routing and file-name generation separately from network calls.
Finding 5: Season-specific data shape is embedded across layers
Severity: Medium
Evidence:
public.match_data_2026in db/schema.sql.matches_data_2026_from_json()parser with 2026 game-specific fields.*2026*match files.CURR_YEAR = 2026.2026-04-10in data_science/lightgbm/first_try_light_gbm.py and data_science/test_opr.py.Why it matters:
FRC scoring formats change by season, so some season specialization is appropriate. The architectural issue is not that 2026 logic exists; it is that season selection, parser choice, database naming, and experiment periods are scattered across unrelated layers. That makes the project harder to extend to 2027 or backfill older years without editing a wide swath of files.
Recommendation:
Separate season-invariant concepts from season-specific score adapters:
matches,teams,events, andmatch_teamsstable.Finding 6: Rolling feature generation likely leaks current-match data
Severity: High for modeling correctness
Evidence:
auto_pointstoteam_auto_dict[current_team]before calculating historical statistics.scores[-i:]. After the current score has been appended, those slices include the target match's value for common window sizes.Why it matters:
Temporal features intended for prediction must only use information available before the match being predicted. Leakage can make model evaluation look stronger than reality and conceal genuine generalization problems.
Recommendation:
Define temporal feature semantics precisely: each feature row for match
Mmay use only matches withactual_time < M.actual_time, with deterministic tie handling. Compute features from the prior history first, then append the current result to history. Add small fixture tests where the correct mean, max, count, and trend can be verified by hand.Finding 7: The analytics layer is exploratory, but its boundaries are not labeled as such
Severity: Medium
Evidence:
model.fit(X_train, y_test)rather than pairing training features with training labels, and its feature matrix still contains team-key strings rather than a validated numeric encoding.Why it matters:
Exploratory scripts are normal in a data science repo. The architectural risk appears when experiments look like sanctioned pipeline stages but lack the protections expected of those stages. Readers may assume model metrics are reproducible, feature inputs are stable, or dashboard behavior matches documentation when those are still evolving.
Recommendation:
Make the distinction explicit:
experimentsor clearly labeled research area.predictionsschema only after prediction outputs have a documented contract.Finding 8: Reproducibility and documentation have drifted from code
Severity: Medium
Evidence:
lightgbm,scikit-learn,scipy,seaborn, anddython.test_*.pyare currently exploratory scripts rather than assertion-driven tests.Why it matters:
A data platform is reproducible only when code, dependencies, schema, and operating instructions agree. Drift here increases onboarding cost and makes future results harder to trust.
Recommendation:
At minimum:
pyproject.toml-based toolchain.Suggested Target Architecture
The current code does not need to abandon its directory structure immediately. A practical target is to introduce clearer boundaries while preserving the existing data flow.
flowchart LR SETTINGS[Validated settings] CLI[CLI or task entrypoints] API[Typed API clients\nFIRST and TBA] RAW[Raw store\nJSON plus optional run manifest] PARSERS[Season-aware parsers] CORE[(Public relational tables)] MIGRATIONS[Migrations and schema dump\none authority] FEATURES[Feature builders\nSQL and Python] FEATUREDB[(Features schema)] EXPERIMENTS[Experiments and notebooks] TRAIN[Stable training jobs] PRED[(Predictions schema)] TESTS[Fixture and integration tests] SETTINGS --> CLI CLI --> API API --> RAW RAW --> PARSERS PARSERS --> CORE MIGRATIONS --> CORE CORE --> FEATURES FEATURES --> FEATUREDB FEATUREDB --> EXPERIMENTS FEATUREDB --> TRAIN TRAIN --> PRED TESTS --> API TESTS --> PARSERS TESTS --> FEATURESThe target architecture introduces three ideas that the current repository is ready for:
Recommended Improvement Roadmap
Phase 1: Restore Contract Integrity
SCHEMA, season range, credentials, and database DSN.features.matches_edarefresh SQL so it matches current migrations.Phase 2: Harden Ingestion and Loads
Phase 3: Make the Feature Layer Trustworthy
Phase 4: Mature the Analytics Surface
predictionsschema contract is clear.Recommended Test Matrix
REPO_ROOT,SCHEMA, season bounds, env-name mappingsql/features/load_filesscriptsSuggested Package Evolution
This is a direction, not a required immediate rewrite. The repo could gradually converge toward a structure like:
The migration can be incremental: first extract stable functions from current scripts, then move only the functions that have proven reusable.
Closing Judgment
The project already expresses a good platform idea: use raw snapshots to feed a relational warehouse, then build ML-ready features and analytical tools on top. The work worth doing next is chiefly contract work. Make configuration truthful, schema evolution authoritative, API calls resilient, temporal features trustworthy, and pipeline stages explicit. Once those foundations are in place, adding seasons, stronger models, and richer analytics will become much cheaper and much safer.# FRC Data Science Project — Architecture & Code Review
1. Executive Summary
The project implements a classic data-pipeline architecture:
External APIs → JSON snapshots on disk → relational store (PostgreSQL) →
feature engineering → ML models + interactive EDA. The layering is
directionally correct and the use of
pathlib,SQLAlchemy, anddbmate-style migrations is solid.However, the codebase shows symptoms of organic growth without
periodic consolidation:
The good news: the directory layout already suggests the correct
abstractions. Most fixes are mechanical refactors, not redesigns.
2. High-Level Architecture
flowchart TD subgraph EXT["External Services"] FRC["FRC API v3"] TBA["The Blue Alliance API v3"] end subgraph API["api/"] URL["frc_and_tba_api_url_builders.py"] CRED["get_credentials.py"] PATH["get_folder_file_both_from_api_url.py"] PULL["pull_and_save_api_url_data.py"] SCRIPTS["scripts/*_pull_save_create_reference.py"] end subgraph DISK["api/json/ - raw JSON cache"] D1["districts_data/"] D2["events_data/"] D3["matches_data/"] D4["teams_data/"] D5["reference_data/"] end subgraph SQL["sql/ - ETL + DB access"] LOAD["scripts/load_files/10..40_*.py"] TRANSFORM["match_data_from_json.py"] DBACCESS["data_science_db.py / load_db_to_df.py"] FEATSQL["features/load_files/*.sql"] end subgraph PG["PostgreSQL"] PUB["public.* - raw"] FEAT["features.* - engineered"] PRED["predictions.* - empty"] end subgraph DS["data_science/"] FE["feature_eng/auto_averages.py"] OPR["test_opr.py"] LGBM["lightgbm/first_try_light_gbm.py"] EDA["eda/test_streamlit_server.py"] end FRC --> URL TBA --> URL URL --> PULL CRED --> PULL PATH --> PULL PULL --> DISK SCRIPTS --> PULL DISK --> LOAD TRANSFORM --> LOAD LOAD --> PUB PUB --> FEATSQL --> FEAT FEAT --> FE --> FEAT FEAT --> OPR FEAT --> LGBM FEAT --> EDA DBACCESS -.-> SQL DBACCESS -.-> DSLayer responsibilities
api/json/+ db/migrations/3. Module-by-Module Critique
3.1
config/Centralizes paths and DB connection. Good use of
pathlibanddotenv.Issues
START_YEAR,END_YEAR, andSCHEMAare imported by api/scripts/ but never defined anywhere — these scripts cannot run as-is.127.0.0.1:5432despite.envbeing loaded for everything else.IMPORTANT_FEATURESis a long hand-edited Python list with no link to the DB schema; columns can silently drift.3.2
api/A reasonable URL-builder / fetcher split.
Issues
pull_and_save_*functions — only headers/base URL differ..git. Brittle; a dict lookup of{endpoint_kind: folder}would replace it.try/exceptaroundrequests.get, no retry, no rate limit, no status-code check beyond the obvious. A single transient 5xx aborts the whole multi-year pull.first_api_bad_before_2020/is named as if it were dead code but is still in the import graph.3.3
sql/DB access layer is the strongest part of the codebase: a shared SQLAlchemy
ENGINE, context-managed connections, and a sensible read helper.Issues
matches_data_2026_from_json— the season is baked into the function name, meaning next year requires a copy-paste. The same function also assignstotal_auto_pointstwice with different values.*2026*JSON files.SELECT key FROM tableinto a Pythonset— this does not scale and ignores the obviousINSERT ... ON CONFLICT DO NOTHING.confirm()from inside the write function, blocking any batch / scheduled use.3.4
db/dbmate-style timestamped migrations — good.Issues
20260512174020_add feature and predictions schema.sqlcontains a space — annoying for shell tools.3.5
data_science/feature_eng/auto_averages.pydf.groupby('team_key').rolling(window=N)replaces ~80% of this code and runs orders of magnitude faster.from function_for_feature_eng import ...only works when CWD happens to be the right directory — should befrom data_science.feature_eng.function_for_feature_eng import ....[3, 5, 10, 'season']should live inconfig/data_science.py.test_opr.pyX_trainand then evaluates againsty_testpaired with predictions fromX_train— train/test leakage / mis-aligned shapes. Numbers reported from this script are not trustworthy.lightgbm/first_try_light_gbm.pyeval_setduring training (early-stopping on the test set ⇒ optimistic test metrics).categorical_feature=[...].eda/test_streamlit_server.py@st.cache_datawith no TTL — dashboard shows stale data after DB writes until the Python process restarts.dythonimport; large block of commented-out code.3.6
TestCode/and root scriptstest.py/test_plot.ipynbare ad-hoc scratch — not pytest tests. There is no automated test suite at all.confirm.pylives at repo root but is really a small utility; belongs in autils/package.requirements.txtdoes not pin versions — reproducibility risk.ystemctl list-units --type=service --state=running:qat the repo root looks like an accidental commit from a botched terminal command. Delete it.4. Cross-Cutting Concerns
flowchart LR subgraph Concerns C1[Configuration] C2[Error handling] C3[Logging] C4[Testing] C5[Transactions] C6[Reproducibility] end C1 -->|missing consts| BUG1[Import failures] C2 -->|absent| BUG2[Whole pipeline aborts on 1 bad response] C3 -->|print only| BUG3[Hard to diagnose batch runs] C4 -->|none| BUG4[Refactors are scary] C5 -->|if_exists=append| BUG5[Partial loads on crash] C6 -->|unpinned deps| BUG6[Works on my machine]START_YEAR,END_YEAR,SCHEMA; move all magic numbers in.env+ base64.env.example, document required varsrequestswith retry + status-code raiseprintonlylogging, one logger per modulepytest+ fixtures for JSON samples; start withmatch_data_from_jsonON CONFLICT DO NOTHINGteam_key1/2/3, year baked into names)year; rename position-keyed columns to colour/role5. Code Smells — Ranked
START_YEAR,END_YEAR,SCHEMA.git-walk to derive pathsystemctl ... :q6. Proposed Target Architecture
A relatively small rearrangement separates concerns the current layout already implies:
flowchart TD subgraph cfg["config/"] C["settings.py - pydantic-settings"] end subgraph ing["ingestion/ (was api/)"] CLI1["clients/frc.py"] CLI2["clients/tba.py"] BASE["clients/base.py - retry, auth, paging"] STORE["storage/json_cache.py"] end subgraph etl["etl/ (was sql/scripts/)"] PARSE["parsers/matches.py"] LOAD["loaders/postgres.py - upsert"] end subgraph db["db/"] MIG["migrations/"] REPO["repository/ - typed queries"] end subgraph feat["features/"] ROLL["rolling_stats.py - vectorized"] SQLFE["sql/*.sql"] end subgraph models["models/"] OPR2["opr.py"] GBM["lightgbm/"] TRAIN["train.py - CV, configs"] end subgraph apps["apps/"] DASH["dashboard/ - streamlit"] end subgraph tests["tests/"] UT["unit/"] IT["integration/"] FIX["fixtures/sample_*.json"] end C --> ing C --> etl C --> feat C --> models BASE --> CLI1 BASE --> CLI2 CLI1 --> STORE CLI2 --> STORE STORE --> PARSE PARSE --> LOAD LOAD --> REPO REPO --> feat feat --> models models --> apps tests -.-> ing tests -.-> etl tests -.-> feat tests -.-> modelsKey differences from today:
BaseAPIClientwith retry/auth/paging; FRC and TBA subclass it. Replaces allpull_and_save_*duplicates.repository/module wraps SQL queries used by feature/model code, so call sites don't write SQL strings.models/train.pyis configuration-driven (YAML or pydantic) so OPR and LightGBM share a single train/eval loop with proper CV.tests/mirrors source layout.7. Prioritized Roadmap
Tier 1 — Make it run correctly (hours)
START_YEAR,END_YEAR,SCHEMAto config/general.py.ystemctl ...file at repo root.requirements.txt(e.g.pip freeze > requirements.lock.txt).Tier 2 — Remove duplication & brittleness (days)
pull_and_save_*family into a single client with auth strategies.matches_data_2026_from_jsonby year; pick parser from a{year: parser}registry.INSERT ... ON CONFLICT DO NOTHINGand drop the in-Python set check.config/.Tier 3 — Robustness & observability (days)
requests-retry (e.g.urllib3.Retryadapter) with backoff and status-code checks.logging(one module-level logger) and replaceprint.pytestwith fixtures for one JSON sample per entity; cover parsers first.write_df_to_postgres(move it to the calling CLI).ttl=to Streamlit caches.Tier 4 — Modeling discipline (days)
groupby().rolling().eval_set.Tier 5 — Polish (days)
confirm.pyinto autils/package; add type hints; runruff+mypyin CI.README.md(link to this doc; includemake/uvcommands).api/first_api_bad_before_2020/outside the import tree.8. What's Already Good
dbmate-style naming.public) vs engineered (features) vs serving (predictions) schemas — exactly the right shape.The bones are right. The work is mostly tightening the joints rather than rebuilding.