AISIS Data Scraper & Supabase Sync

This project contains a Node.js-based web scraper that automatically logs into AISIS, scrapes institutional data (class schedules and curriculum), and syncs the data to Supabase. The scraper is designed to run on a schedule using GitHub Actions.

Features

Automated Scraping: Runs on a scheduled basis via GitHub Actions.
Dynamic Department Discovery: Automatically discovers and scrapes all departments from the AISIS dropdown (IE, LCS, and any future departments are included without code changes).
Multi-Term Support: Can scrape current term, current + next term (new default), future terms, all available terms, or all terms in the current academic year in one run. See MULTI_TERM_SCRAPING.md.
Institutional Data Focus: Scrapes class schedules and official curriculum data.
Supabase Integration: Automatically syncs data to Supabase via Edge Functions.
Batched Sync Architecture: Two-layer batching prevents 504 timeouts when syncing thousands of records.
Secure Credential Management: Uses GitHub Secrets for secure storage of credentials.
Fast Mode: Switched from Puppeteer to Direct HTTP Requests (node-fetch + Cheerio) for speed, stability, and low memory usage.
Production-Grade: Built with error handling, robust data transformation, and partial failure recovery.
🛡️ Data Loss Protection: Comprehensive sanity checks and per-department baseline tracking prevent destructive syncs when AISIS misbehaves (e.g., returns wrong courses). See Data Loss Protection below.

Note: As of the latest update, the default AISIS_SCRAPE_MODE changed from current to current_next. This means the scraper now fetches both the current term and the next term by default. To restore the previous single-term behavior, set AISIS_SCRAPE_MODE=current in your environment.

Data Categories Scraped

Schedule of Classes: All available class schedules for all departments (runs every 6 hours). Supports multi-term scraping. ✅ Working
Official Curriculum: ⚠️ EXPERIMENTAL - Curriculum scraping now supported via the J_VOFC.do endpoint. See Curriculum Scraping Status below for details.

Curriculum Scraping Status

Status: ⚠️ EXPERIMENTAL - Curriculum scraping is now functional with structured parsing (like schedules)

How It Works

The curriculum scraper uses the J_VOFC.do endpoint discovered through HAR file analysis and now includes structured parsing:

GET J_VOFC.do - Retrieves a form with a dropdown containing all curriculum versions
Parse <select name="degCode"> - Extracts curriculum version identifiers (e.g., BS CS_2024_1)
POST J_VOFC.do with degCode=<value> - Fetches curriculum HTML for each version
Parse HTML to structured rows - Extracts year/semester headers and course data into structured objects (NEW)
Sync to Supabase and Google Sheets - Saves flat course rows with columns like schedules (IMPROVED)

Structured Output

The curriculum scraper now produces row-based structured data similar to schedules, with each course as a separate row containing:

deg_code - Degree program code
program_label - Human-readable program name
year_level - 1-4
semester - 1-2
course_code - Course identifier
course_title - Course name
units - Numeric units
prerequisites - Prerequisites or null
category - Course category (M, C, etc.) or null

This enables direct use in Google Sheets with proper columns, matching the schedule scraping behavior.

Important Warnings

⚠️ This is an EXPERIMENTAL feature that depends on AISIS's HTML structure:

May break if AISIS changes the J_VOFC.do page layout
Not officially documented or supported by AISIS
Discovered through network traffic analysis (HAR file)
Should be treated as best-effort with monitoring

Previous Limitation (J_VOPC.do)

Earlier versions attempted to use the non-existent J_VOPC.do endpoint, which returned HTTP 404. The working alternative J_VOFC.do was discovered later through HAR analysis.

Alternative Solutions (Still Valid)

If J_VOFC.do becomes unreliable, consider:

Scrape public curriculum pages: Extract from ateneo.edu/college/academics/degrees-majors
Manual curriculum data: Maintain curated JSON from official PDFs
Request API access: Contact AISIS administrators for official endpoint

For technical details, see docs/CURRICULUM_LIMITATION.md.

Getting Started

1. Set Up Supabase

You'll need a Supabase project with the appropriate Edge Functions deployed to receive scraped data.

Create a Supabase project at https://supabase.com

Deploy the Edge Functions from supabase/functions/:

# Install Supabase CLI
npm install -g supabase

# Link to your project
supabase link --project-ref YOUR_PROJECT_ID

# Deploy the functions
supabase functions deploy github-data-ingest
supabase functions deploy aisis-scraper
supabase functions deploy scrape-department
supabase functions deploy import-schedules

Set up the database schema (see supabase/functions/README.md)
Generate an authentication token for the data ingest endpoint

2. Configure GitHub Secrets

In your GitHub repository, go to Settings > Secrets and variables > Actions and add the following secrets:

AISIS_USERNAME: Your AISIS username
AISIS_PASSWORD: Your AISIS password
SUPABASE_URL: Your Supabase project URL (e.g., https://your-project-id.supabase.co)
DATA_INGEST_TOKEN: The authentication token for your Supabase data ingest endpoint

3. Term Auto-Detection

The scraper now automatically detects the current academic term from AISIS without requiring manual code changes. It reads the term from the Schedule of Classes page dropdown.

To override the term (e.g., for scraping historical data or for CI/scheduled runs), you can set the AISIS_TERM environment variable:

AISIS_TERM=2025-1 npm start

Or add it to your .env file:

AISIS_TERM=2025-1

Legacy support: The APPLICABLE_PERIOD environment variable is still supported for backward compatibility, but AISIS_TERM takes precedence if both are set.

If no override is provided, the scraper will auto-detect and use the currently selected term in AISIS. Using an override skips the term auto-detection request, which can speed up startup time in CI environments.

3a. Dynamic Department Discovery

The scraper now automatically discovers departments from the AISIS Schedule of Classes page dropdown without requiring code changes.

How It Works

On startup, the scraper fetches the deptCode dropdown from AISIS and extracts all available department codes
New departments (like IE, LCS) are automatically included in scraping runs
If the AISIS fetch fails, the scraper falls back to a hardcoded list in src/constants.js
The AISIS_DEPARTMENTS environment variable can still be used to filter specific departments for testing

Benefits

Future-proof: New departments are automatically discovered without code updates
Always current: Reflects the exact department list AISIS exposes for the current term
Safe fallback: Uses hardcoded list if AISIS fetch fails (network issues, page structure changes)
Developer-friendly: AISIS_DEPARTMENTS filter still works for local testing

Example Output

✅ Using 45 departments from AISIS dropdown (dynamic discovery)
🆕 New departments discovered: IE, LCS

4. Baseline Tracking and Regression Detection

The scraper includes automatic regression detection to alert when scraped record counts drop significantly between runs.

How It Works

After each scrape, the total record count and per-department counts are saved as a "baseline" in logs/baselines/baseline-{term}.json
On subsequent runs for the same term, the current count is compared with the previous baseline
If the count drops by more than a configurable threshold, a warning or error is triggered

Configuration

# Threshold percentage for triggering regression alert (default: 5.0)
BASELINE_DROP_THRESHOLD=5.0 npm start

# Warn-only mode: log warning but don't fail job (default: true)
# Set to false to fail the job when regression is detected
BASELINE_WARN_ONLY=true npm start

Example Output

📊 Baseline Comparison:
   Term: 2025-1
   Previous run: 2025-01-15T10:30:00.000Z
   Previous total: 4000 records
   Current total: 3520 records
   Change: -480 records (-12.00%)
   ⚠️ WARNING: Record count dropped by 480 records (12.00%)
   This exceeds the configured threshold of 5.0%

The baseline files are stored locally in logs/baselines/ and are not committed to git (already in .gitignore). In GitHub Actions, these files are ephemeral unless you configure artifact upload.

5. Performance Tuning

The scraper includes several performance optimization options:

Supabase Sync Batch Size and Concurrency

The sync phase has been optimized with batching and HTTP-level concurrency to reduce total sync time:

# Schedule sync performance
SUPABASE_CLIENT_BATCH_SIZE=2000 npm start  # Default: 2000 records per batch
SCHEDULE_SEND_CONCURRENCY=2 npm start      # Default: 2 concurrent HTTP requests

# Curriculum sync performance
CURRICULUM_SEND_GROUP_SIZE=10 npm run curriculum      # Default: 10 programs per batch
CURRICULUM_SEND_CONCURRENCY=2 npm run curriculum      # Default: 2 concurrent HTTP requests

Schedule Sync Optimization:

SUPABASE_CLIENT_BATCH_SIZE: Controls batch size (default: 2000 records)
- Larger values (e.g., 3000-5000): Fewer HTTP requests, faster sync
- Smaller values (e.g., 500-1000): More granular progress, safer for timeouts
SCHEDULE_SEND_CONCURRENCY: Controls parallel HTTP requests (default: 2)
- Higher values (3-5): Faster sync but more aggressive
- Lower values (1-2): Conservative, safer for Edge Function limits

Curriculum Sync Optimization:

CURRICULUM_SEND_GROUP_SIZE: Number of programs grouped per HTTP request (default: 10)
- Reduces HTTP overhead by sending multiple programs in one call
- Edge function internally batches DB operations (500 records per transaction)
CURRICULUM_SEND_CONCURRENCY: Parallel HTTP requests (default: 2)
- Higher values (3-5): Faster sync for large curriculum sets
- Lower values (1-2): Conservative, reduces Edge Function load

Performance Impact:

Schedules: For ~4000 courses, sync time reduced from ~15 minutes (43 sequential department sends) to ~3-5 minutes (2-3 batches with concurrency 2)
Curriculum: For ~450 programs, sync time reduced from ~15 minutes (450 sequential sends) to ~2-4 minutes (~45 grouped sends with concurrency 2)

The Edge Function further splits large batches into smaller database transactions (schedules: 100 by default via GITHUB_INGEST_DB_BATCH_SIZE, curriculum: 500) to prevent individual transaction timeouts.

Google Sheets Integration

The scraper includes optional Google Sheets integration for easy data visualization and sharing. When enabled, scraped data is automatically synced to a Google Spreadsheet alongside Supabase.

How It Works

The GoogleSheetsManager class (in src/sheets.js) uses the Google Sheets API v4 to:

Clear existing data from the specified sheet tab
Write headers from the first data object's keys
Write data rows with automatic type conversion (objects/arrays → JSON strings)
Auto-format using Google Sheets' USER_ENTERED mode for numbers, dates, etc.

Setup

Create a Google Cloud Service Account:
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Google Sheets API
- Create a Service Account and download the JSON credentials
Share your Google Spreadsheet:
- Create a new Google Sheet or use an existing one
- Share it with the service account email (e.g., your-service@project.iam.gserviceaccount.com)
- Grant "Editor" permissions

Configure Environment Variables:

# Base64-encode your service account JSON file
GOOGLE_SERVICE_ACCOUNT=$(cat service-account.json | base64 -w 0)

# Get your spreadsheet ID from the URL
# https://docs.google.com/spreadsheets/d/SPREADSHEET_ID/edit
SPREADSHEET_ID=your_spreadsheet_id_here

Create Sheet Tabs:
- For schedules: Create a tab named Schedules
- For curriculum: Create a tab named Curriculum

Expected Sheet Names

The scraper syncs data to specific sheet tabs:

Schedules - Class schedule data with columns: department, term_code, subject_code, section, title, units, time, room, instructor, etc.
Curriculum - Curriculum data with columns: deg_code, program_label, year_level, semester, course_code, course_title, units, prerequisites, category

Data Format

Both schedule and curriculum data are synced as flat rows with proper column headers:

First row contains field names (auto-detected from data)
Subsequent rows contain course/curriculum entries
Complex objects are JSON-stringified for compatibility
Numbers and dates are auto-formatted by Google Sheets

Usage Example

// Initialize (service account credentials as Base64)
const sheets = new GoogleSheetsManager(process.env.GOOGLE_SERVICE_ACCOUNT);

// Sync schedule data
await sheets.syncData(spreadsheetId, 'Schedules', scheduleRecords);

// Sync curriculum data  
await sheets.syncData(spreadsheetId, 'Curriculum', curriculumRecords);

Troubleshooting

Error: "Unable to parse range"

Make sure a sheet tab with the exact name exists (case-sensitive)
Create tabs named Schedules and Curriculum if they don't exist

Error: "Permission denied"

Verify the spreadsheet is shared with your service account email
Grant "Editor" permissions (not just "Viewer")

Error: "API not enabled"

Enable the Google Sheets API in your Google Cloud project
Wait a few minutes for the API to become active

Performance Configuration

The scraper provides extensive configuration options for optimizing performance based on your use case (local development, CI, production).

Schedule Scraper Performance Options

Fast Mode (`FAST_MODE`)

Enable aggressive optimizations for faster local development:

FAST_MODE=true npm start

When enabled:

Skips term auto-detection if AISIS_TERM is provided
Skips the single test-department validation pass
Uses minimal batch delays (0ms by default)
Processes all departments immediately in concurrent batches

Use for: Local development, manual testing, rapid iteration Avoid for: Production CI (may be too aggressive for AISIS server)

Concurrency Control (`AISIS_CONCURRENCY`)

Control how many departments are scraped in parallel:

AISIS_CONCURRENCY=12 npm start  # Default: 8

Lower values (1-5): More polite to AISIS, safer for stability
Default (8): Balanced performance and stability
Higher values (10-20): Faster scraping, more aggressive (use with caution)

Batch Delay (`AISIS_BATCH_DELAY_MS`)

Control the delay between batches of departments:

AISIS_BATCH_DELAY_MS=0 npm start  # Default: 500ms

0ms: No delay, maximum speed (use with FAST_MODE)
500ms (default): Polite delay for production
1000ms+: Very conservative, safest for AISIS stability

Department Filtering (`AISIS_DEPARTMENTS`)

Scrape only specific departments (useful for local testing):

AISIS_DEPARTMENTS="DISCS,MA,EN,EC" npm start

Accepts comma-separated list of department codes
Validates against the dynamically discovered department list (or fallback list if AISIS fetch failed)
Invalid codes are warned and ignored
Useful for testing changes without scraping all departments

Example local development run:

FAST_MODE=true \
AISIS_TERM=2025-1 \
AISIS_DEPARTMENTS="DISCS,MA" \
AISIS_CONCURRENCY=2 \
AISIS_BATCH_DELAY_MS=0 \
npm start

Curriculum Scraper Performance Options

Curriculum Limiting (`CURRICULUM_LIMIT`)

Scrape only the first N curriculum programs:

CURRICULUM_LIMIT=10 npm run curriculum

Useful for local development and testing
Takes the first N programs from AISIS dropdown
Default: scrape all programs (typically 50-100+)

Curriculum Sampling (`CURRICULUM_SAMPLE`)

Scrape specific curriculum programs by degree code:

CURRICULUM_SAMPLE="BS CS_2024_1,BS ME_2023_1,BS ECE_2024_1" npm run curriculum

Comma-separated list of exact degCode values
Takes precedence over CURRICULUM_LIMIT
Warns if requested codes are not found in AISIS
Useful for testing specific programs or incremental updates

Curriculum Delay (`CURRICULUM_DELAY_MS`)

Control delay between curriculum requests:

CURRICULUM_DELAY_MS=0 npm run curriculum  # Default: 1000ms (balanced mode), 500ms (fast mode)

0ms: No delay, maximum speed (use for local dev, higher risk)
500ms: Fast mode default - good balance of speed and safety
1000ms (default): Balanced mode - optimized for reliability
2000ms+: Ultra-conservative (opt-in for maximum safety)

Balanced defaults: The 1000ms default provides reliable scraping while maintaining reasonable performance (~10-15 minutes for all curricula).

Curriculum Concurrency (`CURRICULUM_CONCURRENCY`)

Scrape multiple curriculum programs in parallel:

CURRICULUM_CONCURRENCY=3 npm run curriculum  # Default: 2 (balanced parallelism)

1: Sequential scraping (ultra-safe mode, opt-in for maximum safety)
2 (default): Balanced parallelism - reliable and prevents session bleed
3-4: Higher parallelism - faster, increased risk of session bleed
5-10: Maximum parallelism - fastest, highest risk of session bleed

Balanced defaults: The default of 2 provides parallel scraping while minimizing AISIS session bleed issues that can occur at higher concurrency levels.

Performance Improvements (v3.3+)

🚀 Curriculum scraping uses balanced defaults for reliability!

The curriculum scraper uses balanced default settings that prioritize reliability while maintaining reasonable performance:

Delay: 1000ms (balanced mode) or 500ms (fast mode) - prevents AISIS session bleed
Concurrency: 2 programs in parallel - uses _scrapeDegreeWithValidation to prevent session bleed
Safety maintained: All requests validated via _scrapeDegreeWithValidation, AISIS_ERROR_PAGE detection, and retry logic

Expected performance (1000ms delay, concurrency 2):

459 programs ÷ 2 = 230 parallel batches
230 × (1000ms delay + ~2s request) = ~690 seconds (~11.5 minutes) in delays
With network overhead and retries: ~10-15 minutes for 459 programs (well under 20-30 minute threshold)

For faster scraping (use FAST_MODE for 500ms delays):

459 programs ÷ 2 = 230 parallel batches
230 × (500ms delay + ~2s request) = ~575 seconds (~9.6 minutes) in delays
With network overhead and retries: ~6-10 minutes for 459 programs

Speed vs. Reliability Tradeoff:

Higher concurrency (>2): Faster but increased risk of AISIS session bleed
Lower delay (<500ms): Faster but may trigger rate limiting or session bleed
These defaults (2 concurrency, 1000ms delay) balance speed with reliability
Session bleed issues observed at concurrency=6, delay=300ms have been eliminated

These defaults have been tested and include robust validation to prevent AISIS session bleed. You can still opt for faster settings at your own risk:

CURRICULUM_DELAY_MS=300 CURRICULUM_CONCURRENCY=6 npm run curriculum

Example fast curriculum scraping:

FAST_MODE=true \
CURRICULUM_LIMIT=20 \
CURRICULUM_DELAY_MS=300 \
CURRICULUM_CONCURRENCY=4 \
npm run curriculum

Recommended Configurations

Local Development (Fast Iteration)

# .env for local development
FAST_MODE=true
AISIS_TERM=2025-1
AISIS_DEPARTMENTS=DISCS,MA
AISIS_CONCURRENCY=4
AISIS_BATCH_DELAY_MS=0
CURRICULUM_LIMIT=5
CURRICULUM_DELAY_MS=500
CURRICULUM_CONCURRENCY=2

GitHub Actions CI (Stable, Production)

# Use defaults for maximum stability (balanced performance + safety)
env:
  AISIS_TERM: '2025-1'  # Skip auto-detection for speed
  # All other settings use balanced defaults
  # AISIS_CONCURRENCY: 8 (default)
  # AISIS_BATCH_DELAY_MS: 500 (default)
  # CURRICULUM_DELAY_MS: 1000 (default - balanced mode)
  # CURRICULUM_CONCURRENCY: 2 (default - balanced parallel with validation)

Manual Full Scrape (Balance Speed & Safety)

AISIS_TERM=2025-1 \
AISIS_CONCURRENCY=10 \
AISIS_BATCH_DELAY_MS=250 \
CURRICULUM_CONCURRENCY=4 \
npm start && npm run curriculum

Performance Monitoring

The scraper logs detailed timing information for each phase:

Initialization: Cookie loading, session setup
Login & validation: AISIS authentication
Term detection: Auto-detect term (skipped if AISIS_TERM set)
Department discovery: Fetch available departments from AISIS
Test department: Single department validation (skipped in FAST_MODE)
Batch processing: Per-batch timing and progress
Supabase sync: Database upload timing
Sheets sync: Google Sheets upload timing

Example output:

⏱  Performance Summary:
   Initialization: 0.3s
   Login & validation: 2.1s
   AISIS scraping: 45.2s
   Supabase sync: 12.4s
   Sheets sync: 8.7s
   Total time: 68.7s

How It Works

GitHub Actions: The project has four workflows:
1. AISIS – Class Schedule (Current + Next Term) (.github/workflows/scrape-institutional-data.yml): Runs every 6 hours to scrape class schedules for both the current term and the next term. This is the primary operational workflow that keeps schedule data fresh. Each term is synced separately to the github-data-ingest function with replace_existing: true to safely replace existing data without cross-term issues.
2. AISIS – Class Schedule (Full Academic Year) (.github/workflows/aisis-schedule-full-year.yml): Manual trigger to scrape all three semesters (intersession, first, second) for a specified academic year
3. AISIS – Class Schedule (All Available Terms) (.github/workflows/scrape-future-terms.yml): Runs weekly to scrape all terms in the current academic year
4. AISIS – Degree Curricula (All Programs) (.github/workflows/scrape-curriculum.yml): Runs weekly to scrape official curriculum data
Scraper (src/scraper.js): This script uses node-fetch to perform direct HTTP requests and cheerio to parse the HTML, eliminating the need for a headless browser (Puppeteer). This makes the scraper significantly faster and more stable.
Supabase Sync (src/supabase.js): This script transforms the scraped data and syncs it to Supabase via the github-data-ingest Edge Function endpoint.
Main Scripts:
- src/index.js: Entry point for scraping class schedules (current term, current+next, or multi-term modes)
- src/scrape-full-year.js: Entry point for full academic year schedule scraping
- src/index-curriculum.js: Entry point for scraping curriculum data
- src/term-utils.js: Helpers for term code calculations (next term, etc.)

Running Locally (for Testing)

Clone the repository:

git clone https://github.com/CelestialBrain/aisis-scraper.git
cd aisis-scraper

Install dependencies:
```
npm install
```

Create a .env file: Copy the .env.example file to .env and fill in your credentials.

cp .env.example .env

Your .env file should contain:

AISIS_USERNAME=your_username
AISIS_PASSWORD=your_password
SUPABASE_URL=https://your-project-id.supabase.co
DATA_INGEST_TOKEN=your_ingest_token

# Optional: Override the term for manual scraping (skips auto-detection)
# AISIS_TERM=2025-1

# Optional: Performance tuning for Supabase sync
# SUPABASE_CLIENT_BATCH_SIZE=2000

Run the scraper:

For class schedules (production-ready):
```
npm start
```
For curriculum data (experimental - see status above):
```
npm run curriculum  # May return curriculum data or empty array
```
For testing the curriculum endpoint:
```
node test-curriculum-endpoint.js
```

Architecture

This is a fast and stable scraper (v3) that:

Uses direct HTTP requests for reliability and speed
Scrapes institutional data (class schedules and experimental curriculum support)
Syncs directly to Supabase via Edge Functions
Includes robust error handling and data transformation

Batching Architecture (v3.1)

To handle large datasets (3000+ schedule records) without timeouts, the system uses two-layer batching with configurable batch sizes for optimal performance.

Layer 1: Client-Side Batching (`src/supabase.js`)

Splits large datasets into configurable chunks (default: 2000 records)
Sends multiple HTTP requests to the Edge Function
Prevents overwhelming the Edge Function with giant payloads
Tracks partial failures across batches
Configurable via SUPABASE_CLIENT_BATCH_SIZE environment variable

Layer 2: Server-Side Batching (Edge Functions)

Further splits each request into 100-record database transactions (default)
Uses upsert with correct onConflict key: term_code,subject_code,section,department
Partial failure handling - one failed batch doesn't block others
Detailed logging for debugging
Configurable via GITHUB_INGEST_DB_BATCH_SIZE environment variable (range: 50-500)

Example: Syncing 3783 schedules (optimized)

Client sends: 2 requests × ~2000 records each
  ↓
Each request: ~20 database batches × 100 records each
  ↓
Total: ~40 database transactions of 100 records
  ↓
Result: No timeouts, faster sync (~5-8 minutes vs 14-15 minutes)

Previous architecture (v3.0): 8 requests × 500 records

Client sends: 8 requests × ~500 records each  
  ↓
Each request: 5 database batches × 100 records each
  ↓
Total: 40 database transactions of 100 records
  ↓
Result: Slower due to HTTP overhead (14-15 minutes)

This architecture ensures:

✅ No 504 Gateway Timeout errors
✅ Graceful handling of partial failures
✅ Idempotent upserts (safe to re-run)
✅ Detailed error logging

For more details, see supabase/functions/README.md.

Security Considerations

All credentials are stored securely in GitHub Secrets or local .env files
The Supabase sync endpoint should be protected with API key authentication
Never commit your .env file to version control

Data Validation and Testing

Parser Validation

The scraper includes comprehensive tests to ensure all course patterns are correctly parsed:

npm test         # Run basic parser tests
npm run test:all # Run all parser tests including PE subject parsing
node tests/test-real-world-patterns.js  # Test with real AISIS patterns

The tests validate:

Decimal course codes: ENE 13.03i, ENGL 298.66, PEPC 13.03
Complex section codes: WXY1, ST1A, PT-GRAD, THES/DISS1-8
0-unit enrollment objects: COMP, SUB-A, SUB-B, THES/DISS, YYY, ODEF, RESID
Special markers: TBA (~) for special enrollment courses
Graduate courses: 200-300 level courses
Lab sections: LAB1-VW, LAB2-VW
PE department subjects: PEPC 10, NSTP 11/CWTS, PHYED 100.20

Subject Validation

After scraping, validate subject distribution across departments:

npm run validate:subjects  # Analyze data/courses.json
node src/validate-subjects.js data/custom-courses.json  # Custom path

This script:

Computes per-department course counts
Computes per-subject prefix breakdown (e.g., PEPC, NSTP, PHYED within PE department)
Identifies missing subject families in critical departments

Example output:

📊 Per-Department Summary:
PE     ( 79 courses): NSTP=79, PEPC=0, PHYED=0

🔍 Critical Department Analysis:
⚠️  PE: PEPC courses missing (count = 0)

Baseline Tracking

Baseline files are stored in logs/baselines/baseline-{term}.json and track:

Total record count per term
Per-department record counts
Timestamp of scrape
GitHub Actions metadata (if running in CI)
Optional: Per-department subject prefix counts (when TRACK_SUBJECT_PREFIXES=true)

Important: Baseline files are local and not committed to git. To preserve baselines across GitHub Actions runs:

Upload baselines as artifacts:

- name: Upload baselines
  uses: actions/upload-artifact@v3
  with:
    name: baselines
    path: logs/baselines/

Download baselines before running scraper:

- name: Download baselines
  uses: actions/download-artifact@v3
  with:
    name: baselines
    path: logs/baselines/
  continue-on-error: true  # Don't fail if no previous baseline

Environment Variables Summary

Variable	Default	Description
Authentication
`AISIS_USERNAME`	-	Required: AISIS login username
`AISIS_PASSWORD`	-	Required: AISIS login password
Data Sync
`DATA_INGEST_TOKEN`	-	Supabase ingest endpoint token
`SUPABASE_URL`	-	Supabase project URL
`GOOGLE_SERVICE_ACCOUNT`	-	Base64-encoded service account JSON
`SPREADSHEET_ID`	-	Google Sheets spreadsheet ID
`SUPABASE_CLIENT_BATCH_SIZE`	`2000`	Records per HTTP request to Supabase
`CURRICULUM_SEND_GROUP_SIZE`	`10`	Programs grouped per HTTP request (1-50)
`CURRICULUM_SEND_CONCURRENCY`	`2`	Concurrent curriculum group sends (1-5)
Term Configuration
`AISIS_TERM`	Auto-detect	Override term code (e.g., `2025-1`)
`APPLICABLE_PERIOD`	Auto-detect	Legacy term override (use `AISIS_TERM` instead)
`AISIS_SCRAPE_MODE`	`current_next`	Scrape mode: `current`, `current_next`, `future`, `all`, or `year`. See MULTI_TERM_SCRAPING.md
Schedule Scraper Performance
`FAST_MODE`	`false`	Enable fast mode (skip validation, minimal delays)
`AISIS_CONCURRENCY`	`8`	Departments to scrape in parallel (1-20)
`AISIS_BATCH_DELAY_MS`	`500`	Delay between department batches (0-5000ms)
`AISIS_DEPARTMENTS`	All	Comma-separated list of departments to scrape
Curriculum Scraper Performance
`CURRICULUM_LIMIT`	All	Limit to first N curriculum programs
`CURRICULUM_SAMPLE`	All	Comma-separated list of specific degree codes
`CURRICULUM_DELAY_MS`	`1000`	Delay between curriculum requests (0-5000ms) - Balanced default
`CURRICULUM_CONCURRENCY`	`2`	Programs to scrape in parallel (1-10) - Balanced default
Regression Detection
`BASELINE_DROP_THRESHOLD`	`5.0`	Overall regression alert threshold (%)
`BASELINE_DEPT_DROP_THRESHOLD`	`0.5`	Per-department regression threshold (0.0-1.0 = 0%-100% drop)
`BASELINE_WARN_ONLY`	`true`	Warn only (don't fail job) on regression
`REQUIRE_BASELINES`	`true`	Fail job if baselines artifact is missing (prevents data loss). See docs/ingestion.md
`TRACK_SUBJECT_PREFIXES`	`false`	Track per-department subject prefix counts in baselines for regression detection
Department Sanity Checks
`SCRAPER_MIN_MA_MATH`	`50`	Minimum MATH courses required for MA (Mathematics) department
`SCRAPER_MIN_PE_COURSES`	`20`	Minimum total courses required for PE department
`SCRAPER_MIN_NSTP_COURSES`	`10`	Minimum NSTP courses required for NSTP departments
Debugging
`DEBUG_SCRAPER`	`false`	Enable detailed debug logging including subject prefix breakdowns

Data Loss Protection

The scraper includes comprehensive safeguards to prevent data loss from AISIS misrouting or HTML quirks. These protections were implemented after a critical incident where the MA (Mathematics) department returned only 13 Korean-language courses instead of 300+ MATH courses, and the scraper used replace_existing=true to wipe out all correct data.

Department-Level Sanity Checks

The scraper performs automatic sanity checks on critical departments during scraping:

MA (Mathematics) Department:

Counts courses with MATH subject prefix
Requires minimum of 50 MATH courses (configurable via SCRAPER_MIN_MA_MATH)
If count is 0 or below threshold, scrape fails and department is marked as failed
Raw HTML is saved to logs/ for debugging

PE (Physical Education) Department:

Requires presence of PEPC and/or PHYED courses
Enforces minimum total course count (default: 20, configurable via SCRAPER_MIN_PE_COURSES)
Detects when required subject prefixes are missing

NSTP Departments:

Requires minimum NSTP-prefixed courses (default: 10, configurable via SCRAPER_MIN_NSTP_COURSES)
Applies to both NSTP (ADAST) and NSTP (OSCI) departments

When a sanity check fails:

The department scrape throws an error
The department is marked as failed in scrape summary
Department's courses are excluded from Supabase sync
Raw HTML response is saved to logs/raw-sanity-check-failed-{term}-{dept}.html
Clear error messages are logged for debugging

Per-Department Baseline Tracking

The baseline system now tracks per-department statistics in addition to overall counts:

Stores baseline file: logs/baselines/baseline-{term}-departments.json
Tracks for each department:
- row_count: Number of courses
- prefix_breakdown: Count of courses by subject prefix (e.g., MATH=305, PEPC=79)
Detects regressions on a per-department basis before syncing to Supabase
Configurable drop threshold (default: 50% via BASELINE_DEPT_DROP_THRESHOLD)

Critical departments (MA, PE, NSTP) that fail regression checks will block replace_existing=true behavior, preventing destructive syncs.

Supabase Sync Hardening

Before syncing schedule data with replace_existing=true:

Pre-sync health check validates all department data
Compares current per-department counts against baselines
Detects critical regressions (e.g., MA dropping from 305 to 13 courses)
If critical departments fail health check:
- Sync is aborted to prevent data loss
- Clear error message explains which departments failed
- replace_existing=true is never sent to Supabase
- Existing good data in database is preserved

Raw HTML Snapshotting

When sanity checks or health checks fail, the scraper automatically:

Creates logs/ directory if it doesn't exist
Saves raw HTML response to timestamped file
Logs file path for manual inspection and debugging
Filename format: logs/raw-{reason}-{term}-{dept}-{timestamp}.html

This allows maintainers to inspect exactly what AISIS returned and diagnose the root cause.

Configuration

All sanity check thresholds are configurable via environment variables:

# MA (Mathematics) department
SCRAPER_MIN_MA_MATH=50          # Minimum MATH courses

# PE (Physical Education) department
SCRAPER_MIN_PE_COURSES=20       # Minimum total courses

# NSTP departments
SCRAPER_MIN_NSTP_COURSES=10     # Minimum NSTP courses

# Per-department regression threshold
BASELINE_DEPT_DROP_THRESHOLD=0.5  # 50% drop triggers regression

# Enable verbose logging
DEBUG_SCRAPER=true

Troubleshooting

If you see sanity check failures:

Check logs for error messages indicating which department failed and why
Inspect raw HTML saved to logs/ directory
Verify AISIS is returning correct data via web browser
Adjust thresholds if legitimate changes occurred (new semester, course restructuring)
Re-run scraper once AISIS issue is resolved

For more details on the baseline system and regression detection, see existing documentation on BASELINE_DROP_THRESHOLD and BASELINE_WARN_ONLY.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 347 Commits
.github		.github
docs		docs
src		src
supabase		supabase
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CURRICULUM_FIX_SUMMARY.md		CURRICULUM_FIX_SUMMARY.md
CURRICULUM_PERFORMANCE_FIX.md		CURRICULUM_PERFORMANCE_FIX.md
CURRICULUM_SESSION_BLEED_FIX.md		CURRICULUM_SESSION_BLEED_FIX.md
CURRICULUM_VERSION_VALIDATION.md		CURRICULUM_VERSION_VALIDATION.md
ENHANCED_VALIDATION_SUMMARY.md		ENHANCED_VALIDATION_SUMMARY.md
IMPLEMENTATION_DETAILS.md		IMPLEMENTATION_DETAILS.md
IMPLEMENTATION_FINAL_SUMMARY.md		IMPLEMENTATION_FINAL_SUMMARY.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
IMPLEMENTATION_SUMMARY_VERSION_FIX.md		IMPLEMENTATION_SUMMARY_VERSION_FIX.md
MULTI_BATCH_FIX_SUMMARY.md		MULTI_BATCH_FIX_SUMMARY.md
MULTI_TERM_SCRAPING.md		MULTI_TERM_SCRAPING.md
PEPC_ENHANCEMENT_SUMMARY.md		PEPC_ENHANCEMENT_SUMMARY.md
PERFORMANCE_IMPROVEMENTS_SUMMARY.md		PERFORMANCE_IMPROVEMENTS_SUMMARY.md
PERFORMANCE_OPTIMIZATION.md		PERFORMANCE_OPTIMIZATION.md
PR_SUMMARY.md		PR_SUMMARY.md
PR_SUMMARY_CURRICULUM.md		PR_SUMMARY_CURRICULUM.md
README.md		README.md
REFACTOR_SUMMARY.md		REFACTOR_SUMMARY.md
SANITY_CHECKS_SUMMARY.md		SANITY_CHECKS_SUMMARY.md
SCHEDULE_SCRAPER_IMPROVEMENTS.md		SCHEDULE_SCRAPER_IMPROVEMENTS.md
SOLUTION_SUMMARY.md		SOLUTION_SUMMARY.md
VALIDATION_SUMMARY.md		VALIDATION_SUMMARY.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
test-curriculum-endpoint.js		test-curriculum-endpoint.js
test-program-title.js		test-program-title.js

Folders and files

Latest commit

History

Repository files navigation

AISIS Data Scraper & Supabase Sync

Features

Data Categories Scraped

Curriculum Scraping Status

How It Works

Structured Output

Important Warnings

Previous Limitation (J_VOPC.do)

Alternative Solutions (Still Valid)

Getting Started

1. Set Up Supabase

2. Configure GitHub Secrets

3. Term Auto-Detection

3a. Dynamic Department Discovery

How It Works

Benefits

Example Output

4. Baseline Tracking and Regression Detection

How It Works

Configuration

Example Output

5. Performance Tuning

Supabase Sync Batch Size and Concurrency

Google Sheets Integration

How It Works

Setup

Expected Sheet Names

Data Format

Usage Example

Troubleshooting

Performance Configuration

Schedule Scraper Performance Options

Fast Mode (FAST_MODE)

Concurrency Control (AISIS_CONCURRENCY)

Batch Delay (AISIS_BATCH_DELAY_MS)

Department Filtering (AISIS_DEPARTMENTS)

Curriculum Scraper Performance Options

Curriculum Limiting (CURRICULUM_LIMIT)

Curriculum Sampling (CURRICULUM_SAMPLE)

Curriculum Delay (CURRICULUM_DELAY_MS)

Curriculum Concurrency (CURRICULUM_CONCURRENCY)

Performance Improvements (v3.3+)

Recommended Configurations

Local Development (Fast Iteration)

GitHub Actions CI (Stable, Production)

Manual Full Scrape (Balance Speed & Safety)

Performance Monitoring

How It Works

Running Locally (for Testing)

Architecture

Batching Architecture (v3.1)

Layer 1: Client-Side Batching (src/supabase.js)

Layer 2: Server-Side Batching (Edge Functions)

Security Considerations

Data Validation and Testing

Parser Validation

Subject Validation

Baseline Tracking

Environment Variables Summary

Data Loss Protection

Department-Level Sanity Checks

Per-Department Baseline Tracking

Supabase Sync Hardening

Raw HTML Snapshotting

Configuration

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Fast Mode (`FAST_MODE`)

Concurrency Control (`AISIS_CONCURRENCY`)

Batch Delay (`AISIS_BATCH_DELAY_MS`)

Department Filtering (`AISIS_DEPARTMENTS`)

Curriculum Limiting (`CURRICULUM_LIMIT`)

Curriculum Sampling (`CURRICULUM_SAMPLE`)

Curriculum Delay (`CURRICULUM_DELAY_MS`)

Curriculum Concurrency (`CURRICULUM_CONCURRENCY`)

Layer 1: Client-Side Batching (`src/supabase.js`)

Packages