Skip to content

CelestialBrain/aisis-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

347 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AISIS Data Scraper & Supabase Sync

This project contains a Node.js-based web scraper that automatically logs into AISIS, scrapes institutional data (class schedules and curriculum), and syncs the data to Supabase. The scraper is designed to run on a schedule using GitHub Actions.

Features

  • Automated Scraping: Runs on a scheduled basis via GitHub Actions.
  • Dynamic Department Discovery: Automatically discovers and scrapes all departments from the AISIS dropdown (IE, LCS, and any future departments are included without code changes).
  • Multi-Term Support: Can scrape current term, current + next term (new default), future terms, all available terms, or all terms in the current academic year in one run. See MULTI_TERM_SCRAPING.md.
  • Institutional Data Focus: Scrapes class schedules and official curriculum data.
  • Supabase Integration: Automatically syncs data to Supabase via Edge Functions.
  • Batched Sync Architecture: Two-layer batching prevents 504 timeouts when syncing thousands of records.
  • Secure Credential Management: Uses GitHub Secrets for secure storage of credentials.
  • Fast Mode: Switched from Puppeteer to Direct HTTP Requests (node-fetch + Cheerio) for speed, stability, and low memory usage.
  • Production-Grade: Built with error handling, robust data transformation, and partial failure recovery.
  • 🛡️ Data Loss Protection: Comprehensive sanity checks and per-department baseline tracking prevent destructive syncs when AISIS misbehaves (e.g., returns wrong courses). See Data Loss Protection below.

Note: As of the latest update, the default AISIS_SCRAPE_MODE changed from current to current_next. This means the scraper now fetches both the current term and the next term by default. To restore the previous single-term behavior, set AISIS_SCRAPE_MODE=current in your environment.

Data Categories Scraped

  1. Schedule of Classes: All available class schedules for all departments (runs every 6 hours). Supports multi-term scraping. ✅ Working
  2. Official Curriculum: ⚠️ EXPERIMENTAL - Curriculum scraping now supported via the J_VOFC.do endpoint. See Curriculum Scraping Status below for details.

Curriculum Scraping Status

Status: ⚠️ EXPERIMENTAL - Curriculum scraping is now functional with structured parsing (like schedules)

How It Works

The curriculum scraper uses the J_VOFC.do endpoint discovered through HAR file analysis and now includes structured parsing:

  1. GET J_VOFC.do - Retrieves a form with a dropdown containing all curriculum versions
  2. Parse <select name="degCode"> - Extracts curriculum version identifiers (e.g., BS CS_2024_1)
  3. POST J_VOFC.do with degCode=<value> - Fetches curriculum HTML for each version
  4. Parse HTML to structured rows - Extracts year/semester headers and course data into structured objects (NEW)
  5. Sync to Supabase and Google Sheets - Saves flat course rows with columns like schedules (IMPROVED)

Structured Output

The curriculum scraper now produces row-based structured data similar to schedules, with each course as a separate row containing:

  • deg_code - Degree program code
  • program_label - Human-readable program name
  • year_level - 1-4
  • semester - 1-2
  • course_code - Course identifier
  • course_title - Course name
  • units - Numeric units
  • prerequisites - Prerequisites or null
  • category - Course category (M, C, etc.) or null

This enables direct use in Google Sheets with proper columns, matching the schedule scraping behavior.

Important Warnings

⚠️ This is an EXPERIMENTAL feature that depends on AISIS's HTML structure:

  • May break if AISIS changes the J_VOFC.do page layout
  • Not officially documented or supported by AISIS
  • Discovered through network traffic analysis (HAR file)
  • Should be treated as best-effort with monitoring

Previous Limitation (J_VOPC.do)

Earlier versions attempted to use the non-existent J_VOPC.do endpoint, which returned HTTP 404. The working alternative J_VOFC.do was discovered later through HAR analysis.

Alternative Solutions (Still Valid)

If J_VOFC.do becomes unreliable, consider:

  1. Scrape public curriculum pages: Extract from ateneo.edu/college/academics/degrees-majors
  2. Manual curriculum data: Maintain curated JSON from official PDFs
  3. Request API access: Contact AISIS administrators for official endpoint

For technical details, see docs/CURRICULUM_LIMITATION.md.

Getting Started

1. Set Up Supabase

You'll need a Supabase project with the appropriate Edge Functions deployed to receive scraped data.

  1. Create a Supabase project at https://supabase.com
  2. Deploy the Edge Functions from supabase/functions/:
    # Install Supabase CLI
    npm install -g supabase
    
    # Link to your project
    supabase link --project-ref YOUR_PROJECT_ID
    
    # Deploy the functions
    supabase functions deploy github-data-ingest
    supabase functions deploy aisis-scraper
    supabase functions deploy scrape-department
    supabase functions deploy import-schedules
  3. Set up the database schema (see supabase/functions/README.md)
  4. Generate an authentication token for the data ingest endpoint

2. Configure GitHub Secrets

In your GitHub repository, go to Settings > Secrets and variables > Actions and add the following secrets:

  • AISIS_USERNAME: Your AISIS username
  • AISIS_PASSWORD: Your AISIS password
  • SUPABASE_URL: Your Supabase project URL (e.g., https://your-project-id.supabase.co)
  • DATA_INGEST_TOKEN: The authentication token for your Supabase data ingest endpoint

3. Term Auto-Detection

The scraper now automatically detects the current academic term from AISIS without requiring manual code changes. It reads the term from the Schedule of Classes page dropdown.

To override the term (e.g., for scraping historical data or for CI/scheduled runs), you can set the AISIS_TERM environment variable:

AISIS_TERM=2025-1 npm start

Or add it to your .env file:

AISIS_TERM=2025-1

Legacy support: The APPLICABLE_PERIOD environment variable is still supported for backward compatibility, but AISIS_TERM takes precedence if both are set.

If no override is provided, the scraper will auto-detect and use the currently selected term in AISIS. Using an override skips the term auto-detection request, which can speed up startup time in CI environments.

3a. Dynamic Department Discovery

The scraper now automatically discovers departments from the AISIS Schedule of Classes page dropdown without requiring code changes.

How It Works

  • On startup, the scraper fetches the deptCode dropdown from AISIS and extracts all available department codes
  • New departments (like IE, LCS) are automatically included in scraping runs
  • If the AISIS fetch fails, the scraper falls back to a hardcoded list in src/constants.js
  • The AISIS_DEPARTMENTS environment variable can still be used to filter specific departments for testing

Benefits

  • Future-proof: New departments are automatically discovered without code updates
  • Always current: Reflects the exact department list AISIS exposes for the current term
  • Safe fallback: Uses hardcoded list if AISIS fetch fails (network issues, page structure changes)
  • Developer-friendly: AISIS_DEPARTMENTS filter still works for local testing

Example Output

✅ Using 45 departments from AISIS dropdown (dynamic discovery)
🆕 New departments discovered: IE, LCS

4. Baseline Tracking and Regression Detection

The scraper includes automatic regression detection to alert when scraped record counts drop significantly between runs.

How It Works

  • After each scrape, the total record count and per-department counts are saved as a "baseline" in logs/baselines/baseline-{term}.json
  • On subsequent runs for the same term, the current count is compared with the previous baseline
  • If the count drops by more than a configurable threshold, a warning or error is triggered

Configuration

# Threshold percentage for triggering regression alert (default: 5.0)
BASELINE_DROP_THRESHOLD=5.0 npm start

# Warn-only mode: log warning but don't fail job (default: true)
# Set to false to fail the job when regression is detected
BASELINE_WARN_ONLY=true npm start

Example Output

📊 Baseline Comparison:
   Term: 2025-1
   Previous run: 2025-01-15T10:30:00.000Z
   Previous total: 4000 records
   Current total: 3520 records
   Change: -480 records (-12.00%)
   ⚠️ WARNING: Record count dropped by 480 records (12.00%)
   This exceeds the configured threshold of 5.0%

The baseline files are stored locally in logs/baselines/ and are not committed to git (already in .gitignore). In GitHub Actions, these files are ephemeral unless you configure artifact upload.

5. Performance Tuning

The scraper includes several performance optimization options:

Supabase Sync Batch Size and Concurrency

The sync phase has been optimized with batching and HTTP-level concurrency to reduce total sync time:

# Schedule sync performance
SUPABASE_CLIENT_BATCH_SIZE=2000 npm start  # Default: 2000 records per batch
SCHEDULE_SEND_CONCURRENCY=2 npm start      # Default: 2 concurrent HTTP requests

# Curriculum sync performance
CURRICULUM_SEND_GROUP_SIZE=10 npm run curriculum      # Default: 10 programs per batch
CURRICULUM_SEND_CONCURRENCY=2 npm run curriculum      # Default: 2 concurrent HTTP requests

Schedule Sync Optimization:

  • SUPABASE_CLIENT_BATCH_SIZE: Controls batch size (default: 2000 records)
    • Larger values (e.g., 3000-5000): Fewer HTTP requests, faster sync
    • Smaller values (e.g., 500-1000): More granular progress, safer for timeouts
  • SCHEDULE_SEND_CONCURRENCY: Controls parallel HTTP requests (default: 2)
    • Higher values (3-5): Faster sync but more aggressive
    • Lower values (1-2): Conservative, safer for Edge Function limits

Curriculum Sync Optimization:

  • CURRICULUM_SEND_GROUP_SIZE: Number of programs grouped per HTTP request (default: 10)
    • Reduces HTTP overhead by sending multiple programs in one call
    • Edge function internally batches DB operations (500 records per transaction)
  • CURRICULUM_SEND_CONCURRENCY: Parallel HTTP requests (default: 2)
    • Higher values (3-5): Faster sync for large curriculum sets
    • Lower values (1-2): Conservative, reduces Edge Function load

Performance Impact:

  • Schedules: For ~4000 courses, sync time reduced from ~15 minutes (43 sequential department sends) to ~3-5 minutes (2-3 batches with concurrency 2)
  • Curriculum: For ~450 programs, sync time reduced from ~15 minutes (450 sequential sends) to ~2-4 minutes (~45 grouped sends with concurrency 2)

The Edge Function further splits large batches into smaller database transactions (schedules: 100 by default via GITHUB_INGEST_DB_BATCH_SIZE, curriculum: 500) to prevent individual transaction timeouts.

Google Sheets Integration

The scraper includes optional Google Sheets integration for easy data visualization and sharing. When enabled, scraped data is automatically synced to a Google Spreadsheet alongside Supabase.

How It Works

The GoogleSheetsManager class (in src/sheets.js) uses the Google Sheets API v4 to:

  1. Clear existing data from the specified sheet tab
  2. Write headers from the first data object's keys
  3. Write data rows with automatic type conversion (objects/arrays → JSON strings)
  4. Auto-format using Google Sheets' USER_ENTERED mode for numbers, dates, etc.

Setup

  1. Create a Google Cloud Service Account:

    • Go to Google Cloud Console
    • Create a new project or select an existing one
    • Enable the Google Sheets API
    • Create a Service Account and download the JSON credentials
  2. Share your Google Spreadsheet:

    • Create a new Google Sheet or use an existing one
    • Share it with the service account email (e.g., your-service@project.iam.gserviceaccount.com)
    • Grant "Editor" permissions
  3. Configure Environment Variables:

    # Base64-encode your service account JSON file
    GOOGLE_SERVICE_ACCOUNT=$(cat service-account.json | base64 -w 0)
    
    # Get your spreadsheet ID from the URL
    # https://docs.google.com/spreadsheets/d/SPREADSHEET_ID/edit
    SPREADSHEET_ID=your_spreadsheet_id_here
  4. Create Sheet Tabs:

    • For schedules: Create a tab named Schedules
    • For curriculum: Create a tab named Curriculum

Expected Sheet Names

The scraper syncs data to specific sheet tabs:

  • Schedules - Class schedule data with columns: department, term_code, subject_code, section, title, units, time, room, instructor, etc.
  • Curriculum - Curriculum data with columns: deg_code, program_label, year_level, semester, course_code, course_title, units, prerequisites, category

Data Format

Both schedule and curriculum data are synced as flat rows with proper column headers:

  • First row contains field names (auto-detected from data)
  • Subsequent rows contain course/curriculum entries
  • Complex objects are JSON-stringified for compatibility
  • Numbers and dates are auto-formatted by Google Sheets

Usage Example

// Initialize (service account credentials as Base64)
const sheets = new GoogleSheetsManager(process.env.GOOGLE_SERVICE_ACCOUNT);

// Sync schedule data
await sheets.syncData(spreadsheetId, 'Schedules', scheduleRecords);

// Sync curriculum data  
await sheets.syncData(spreadsheetId, 'Curriculum', curriculumRecords);

Troubleshooting

Error: "Unable to parse range"

  • Make sure a sheet tab with the exact name exists (case-sensitive)
  • Create tabs named Schedules and Curriculum if they don't exist

Error: "Permission denied"

  • Verify the spreadsheet is shared with your service account email
  • Grant "Editor" permissions (not just "Viewer")

Error: "API not enabled"

  • Enable the Google Sheets API in your Google Cloud project
  • Wait a few minutes for the API to become active

Performance Configuration

The scraper provides extensive configuration options for optimizing performance based on your use case (local development, CI, production).

Schedule Scraper Performance Options

Fast Mode (FAST_MODE)

Enable aggressive optimizations for faster local development:

FAST_MODE=true npm start

When enabled:

  • Skips term auto-detection if AISIS_TERM is provided
  • Skips the single test-department validation pass
  • Uses minimal batch delays (0ms by default)
  • Processes all departments immediately in concurrent batches

Use for: Local development, manual testing, rapid iteration Avoid for: Production CI (may be too aggressive for AISIS server)

Concurrency Control (AISIS_CONCURRENCY)

Control how many departments are scraped in parallel:

AISIS_CONCURRENCY=12 npm start  # Default: 8
  • Lower values (1-5): More polite to AISIS, safer for stability
  • Default (8): Balanced performance and stability
  • Higher values (10-20): Faster scraping, more aggressive (use with caution)

Batch Delay (AISIS_BATCH_DELAY_MS)

Control the delay between batches of departments:

AISIS_BATCH_DELAY_MS=0 npm start  # Default: 500ms
  • 0ms: No delay, maximum speed (use with FAST_MODE)
  • 500ms (default): Polite delay for production
  • 1000ms+: Very conservative, safest for AISIS stability

Department Filtering (AISIS_DEPARTMENTS)

Scrape only specific departments (useful for local testing):

AISIS_DEPARTMENTS="DISCS,MA,EN,EC" npm start
  • Accepts comma-separated list of department codes
  • Validates against the dynamically discovered department list (or fallback list if AISIS fetch failed)
  • Invalid codes are warned and ignored
  • Useful for testing changes without scraping all departments

Example local development run:

FAST_MODE=true \
AISIS_TERM=2025-1 \
AISIS_DEPARTMENTS="DISCS,MA" \
AISIS_CONCURRENCY=2 \
AISIS_BATCH_DELAY_MS=0 \
npm start

Curriculum Scraper Performance Options

Curriculum Limiting (CURRICULUM_LIMIT)

Scrape only the first N curriculum programs:

CURRICULUM_LIMIT=10 npm run curriculum
  • Useful for local development and testing
  • Takes the first N programs from AISIS dropdown
  • Default: scrape all programs (typically 50-100+)

Curriculum Sampling (CURRICULUM_SAMPLE)

Scrape specific curriculum programs by degree code:

CURRICULUM_SAMPLE="BS CS_2024_1,BS ME_2023_1,BS ECE_2024_1" npm run curriculum
  • Comma-separated list of exact degCode values
  • Takes precedence over CURRICULUM_LIMIT
  • Warns if requested codes are not found in AISIS
  • Useful for testing specific programs or incremental updates

Curriculum Delay (CURRICULUM_DELAY_MS)

Control delay between curriculum requests:

CURRICULUM_DELAY_MS=0 npm run curriculum  # Default: 1000ms (balanced mode), 500ms (fast mode)
  • 0ms: No delay, maximum speed (use for local dev, higher risk)
  • 500ms: Fast mode default - good balance of speed and safety
  • 1000ms (default): Balanced mode - optimized for reliability
  • 2000ms+: Ultra-conservative (opt-in for maximum safety)

Balanced defaults: The 1000ms default provides reliable scraping while maintaining reasonable performance (~10-15 minutes for all curricula).

Curriculum Concurrency (CURRICULUM_CONCURRENCY)

Scrape multiple curriculum programs in parallel:

CURRICULUM_CONCURRENCY=3 npm run curriculum  # Default: 2 (balanced parallelism)
  • 1: Sequential scraping (ultra-safe mode, opt-in for maximum safety)
  • 2 (default): Balanced parallelism - reliable and prevents session bleed
  • 3-4: Higher parallelism - faster, increased risk of session bleed
  • 5-10: Maximum parallelism - fastest, highest risk of session bleed

Balanced defaults: The default of 2 provides parallel scraping while minimizing AISIS session bleed issues that can occur at higher concurrency levels.

Performance Improvements (v3.3+)

🚀 Curriculum scraping uses balanced defaults for reliability!

The curriculum scraper uses balanced default settings that prioritize reliability while maintaining reasonable performance:

  • Delay: 1000ms (balanced mode) or 500ms (fast mode) - prevents AISIS session bleed
  • Concurrency: 2 programs in parallel - uses _scrapeDegreeWithValidation to prevent session bleed
  • Safety maintained: All requests validated via _scrapeDegreeWithValidation, AISIS_ERROR_PAGE detection, and retry logic

Expected performance (1000ms delay, concurrency 2):

  • 459 programs ÷ 2 = 230 parallel batches
  • 230 × (1000ms delay + ~2s request) = ~690 seconds (~11.5 minutes) in delays
  • With network overhead and retries: ~10-15 minutes for 459 programs (well under 20-30 minute threshold)

For faster scraping (use FAST_MODE for 500ms delays):

  • 459 programs ÷ 2 = 230 parallel batches
  • 230 × (500ms delay + ~2s request) = ~575 seconds (~9.6 minutes) in delays
  • With network overhead and retries: ~6-10 minutes for 459 programs

Speed vs. Reliability Tradeoff:

  • Higher concurrency (>2): Faster but increased risk of AISIS session bleed
  • Lower delay (<500ms): Faster but may trigger rate limiting or session bleed
  • These defaults (2 concurrency, 1000ms delay) balance speed with reliability
  • Session bleed issues observed at concurrency=6, delay=300ms have been eliminated

These defaults have been tested and include robust validation to prevent AISIS session bleed. You can still opt for faster settings at your own risk:

CURRICULUM_DELAY_MS=300 CURRICULUM_CONCURRENCY=6 npm run curriculum

Example fast curriculum scraping:

FAST_MODE=true \
CURRICULUM_LIMIT=20 \
CURRICULUM_DELAY_MS=300 \
CURRICULUM_CONCURRENCY=4 \
npm run curriculum

Recommended Configurations

Local Development (Fast Iteration)

# .env for local development
FAST_MODE=true
AISIS_TERM=2025-1
AISIS_DEPARTMENTS=DISCS,MA
AISIS_CONCURRENCY=4
AISIS_BATCH_DELAY_MS=0
CURRICULUM_LIMIT=5
CURRICULUM_DELAY_MS=500
CURRICULUM_CONCURRENCY=2

GitHub Actions CI (Stable, Production)

# Use defaults for maximum stability (balanced performance + safety)
env:
  AISIS_TERM: '2025-1'  # Skip auto-detection for speed
  # All other settings use balanced defaults
  # AISIS_CONCURRENCY: 8 (default)
  # AISIS_BATCH_DELAY_MS: 500 (default)
  # CURRICULUM_DELAY_MS: 1000 (default - balanced mode)
  # CURRICULUM_CONCURRENCY: 2 (default - balanced parallel with validation)

Manual Full Scrape (Balance Speed & Safety)

AISIS_TERM=2025-1 \
AISIS_CONCURRENCY=10 \
AISIS_BATCH_DELAY_MS=250 \
CURRICULUM_CONCURRENCY=4 \
npm start && npm run curriculum

Performance Monitoring

The scraper logs detailed timing information for each phase:

  • Initialization: Cookie loading, session setup
  • Login & validation: AISIS authentication
  • Term detection: Auto-detect term (skipped if AISIS_TERM set)
  • Department discovery: Fetch available departments from AISIS
  • Test department: Single department validation (skipped in FAST_MODE)
  • Batch processing: Per-batch timing and progress
  • Supabase sync: Database upload timing
  • Sheets sync: Google Sheets upload timing

Example output:

⏱  Performance Summary:
   Initialization: 0.3s
   Login & validation: 2.1s
   AISIS scraping: 45.2s
   Supabase sync: 12.4s
   Sheets sync: 8.7s
   Total time: 68.7s

How It Works

  • GitHub Actions: The project has four workflows:
    1. AISIS – Class Schedule (Current + Next Term) (.github/workflows/scrape-institutional-data.yml): Runs every 6 hours to scrape class schedules for both the current term and the next term. This is the primary operational workflow that keeps schedule data fresh. Each term is synced separately to the github-data-ingest function with replace_existing: true to safely replace existing data without cross-term issues.
    2. AISIS – Class Schedule (Full Academic Year) (.github/workflows/aisis-schedule-full-year.yml): Manual trigger to scrape all three semesters (intersession, first, second) for a specified academic year
    3. AISIS – Class Schedule (All Available Terms) (.github/workflows/scrape-future-terms.yml): Runs weekly to scrape all terms in the current academic year
    4. AISIS – Degree Curricula (All Programs) (.github/workflows/scrape-curriculum.yml): Runs weekly to scrape official curriculum data
  • Scraper (src/scraper.js): This script uses node-fetch to perform direct HTTP requests and cheerio to parse the HTML, eliminating the need for a headless browser (Puppeteer). This makes the scraper significantly faster and more stable.
  • Supabase Sync (src/supabase.js): This script transforms the scraped data and syncs it to Supabase via the github-data-ingest Edge Function endpoint.
  • Main Scripts:
    • src/index.js: Entry point for scraping class schedules (current term, current+next, or multi-term modes)
    • src/scrape-full-year.js: Entry point for full academic year schedule scraping
    • src/index-curriculum.js: Entry point for scraping curriculum data
    • src/term-utils.js: Helpers for term code calculations (next term, etc.)

Running Locally (for Testing)

  1. Clone the repository:

    git clone https://github.com/CelestialBrain/aisis-scraper.git
    cd aisis-scraper
  2. Install dependencies:

    npm install
  3. Create a .env file: Copy the .env.example file to .env and fill in your credentials.

    cp .env.example .env

    Your .env file should contain:

    AISIS_USERNAME=your_username
    AISIS_PASSWORD=your_password
    SUPABASE_URL=https://your-project-id.supabase.co
    DATA_INGEST_TOKEN=your_ingest_token
    
    # Optional: Override the term for manual scraping (skips auto-detection)
    # AISIS_TERM=2025-1
    
    # Optional: Performance tuning for Supabase sync
    # SUPABASE_CLIENT_BATCH_SIZE=2000
    
  4. Run the scraper:

    For class schedules (production-ready):

    npm start

    For curriculum data (experimental - see status above):

    npm run curriculum  # May return curriculum data or empty array

    For testing the curriculum endpoint:

    node test-curriculum-endpoint.js

Architecture

This is a fast and stable scraper (v3) that:

  • Uses direct HTTP requests for reliability and speed
  • Scrapes institutional data (class schedules and experimental curriculum support)
  • Syncs directly to Supabase via Edge Functions
  • Includes robust error handling and data transformation

Batching Architecture (v3.1)

To handle large datasets (3000+ schedule records) without timeouts, the system uses two-layer batching with configurable batch sizes for optimal performance.

Layer 1: Client-Side Batching (src/supabase.js)

  • Splits large datasets into configurable chunks (default: 2000 records)
  • Sends multiple HTTP requests to the Edge Function
  • Prevents overwhelming the Edge Function with giant payloads
  • Tracks partial failures across batches
  • Configurable via SUPABASE_CLIENT_BATCH_SIZE environment variable

Layer 2: Server-Side Batching (Edge Functions)

  • Further splits each request into 100-record database transactions (default)
  • Uses upsert with correct onConflict key: term_code,subject_code,section,department
  • Partial failure handling - one failed batch doesn't block others
  • Detailed logging for debugging
  • Configurable via GITHUB_INGEST_DB_BATCH_SIZE environment variable (range: 50-500)

Example: Syncing 3783 schedules (optimized)

Client sends: 2 requests × ~2000 records each
  ↓
Each request: ~20 database batches × 100 records each
  ↓
Total: ~40 database transactions of 100 records
  ↓
Result: No timeouts, faster sync (~5-8 minutes vs 14-15 minutes)

Previous architecture (v3.0): 8 requests × 500 records

Client sends: 8 requests × ~500 records each  
  ↓
Each request: 5 database batches × 100 records each
  ↓
Total: 40 database transactions of 100 records
  ↓
Result: Slower due to HTTP overhead (14-15 minutes)

This architecture ensures:

  • ✅ No 504 Gateway Timeout errors
  • ✅ Graceful handling of partial failures
  • ✅ Idempotent upserts (safe to re-run)
  • ✅ Detailed error logging

For more details, see supabase/functions/README.md.

Security Considerations

  • All credentials are stored securely in GitHub Secrets or local .env files
  • The Supabase sync endpoint should be protected with API key authentication
  • Never commit your .env file to version control

Data Validation and Testing

Parser Validation

The scraper includes comprehensive tests to ensure all course patterns are correctly parsed:

npm test         # Run basic parser tests
npm run test:all # Run all parser tests including PE subject parsing
node tests/test-real-world-patterns.js  # Test with real AISIS patterns

The tests validate:

  • Decimal course codes: ENE 13.03i, ENGL 298.66, PEPC 13.03
  • Complex section codes: WXY1, ST1A, PT-GRAD, THES/DISS1-8
  • 0-unit enrollment objects: COMP, SUB-A, SUB-B, THES/DISS, YYY, ODEF, RESID
  • Special markers: TBA (~) for special enrollment courses
  • Graduate courses: 200-300 level courses
  • Lab sections: LAB1-VW, LAB2-VW
  • PE department subjects: PEPC 10, NSTP 11/CWTS, PHYED 100.20

Subject Validation

After scraping, validate subject distribution across departments:

npm run validate:subjects  # Analyze data/courses.json
node src/validate-subjects.js data/custom-courses.json  # Custom path

This script:

  • Computes per-department course counts
  • Computes per-subject prefix breakdown (e.g., PEPC, NSTP, PHYED within PE department)
  • Identifies missing subject families in critical departments

Example output:

📊 Per-Department Summary:
PE     ( 79 courses): NSTP=79, PEPC=0, PHYED=0

🔍 Critical Department Analysis:
⚠️  PE: PEPC courses missing (count = 0)

Baseline Tracking

Baseline files are stored in logs/baselines/baseline-{term}.json and track:

  • Total record count per term
  • Per-department record counts
  • Timestamp of scrape
  • GitHub Actions metadata (if running in CI)
  • Optional: Per-department subject prefix counts (when TRACK_SUBJECT_PREFIXES=true)

Important: Baseline files are local and not committed to git. To preserve baselines across GitHub Actions runs:

  1. Upload baselines as artifacts:

    - name: Upload baselines
      uses: actions/upload-artifact@v3
      with:
        name: baselines
        path: logs/baselines/
  2. Download baselines before running scraper:

    - name: Download baselines
      uses: actions/download-artifact@v3
      with:
        name: baselines
        path: logs/baselines/
      continue-on-error: true  # Don't fail if no previous baseline

Environment Variables Summary

Variable Default Description
Authentication
AISIS_USERNAME - Required: AISIS login username
AISIS_PASSWORD - Required: AISIS login password
Data Sync
DATA_INGEST_TOKEN - Supabase ingest endpoint token
SUPABASE_URL - Supabase project URL
GOOGLE_SERVICE_ACCOUNT - Base64-encoded service account JSON
SPREADSHEET_ID - Google Sheets spreadsheet ID
SUPABASE_CLIENT_BATCH_SIZE 2000 Records per HTTP request to Supabase
CURRICULUM_SEND_GROUP_SIZE 10 Programs grouped per HTTP request (1-50)
CURRICULUM_SEND_CONCURRENCY 2 Concurrent curriculum group sends (1-5)
Term Configuration
AISIS_TERM Auto-detect Override term code (e.g., 2025-1)
APPLICABLE_PERIOD Auto-detect Legacy term override (use AISIS_TERM instead)
AISIS_SCRAPE_MODE current_next Scrape mode: current, current_next, future, all, or year. See MULTI_TERM_SCRAPING.md
Schedule Scraper Performance
FAST_MODE false Enable fast mode (skip validation, minimal delays)
AISIS_CONCURRENCY 8 Departments to scrape in parallel (1-20)
AISIS_BATCH_DELAY_MS 500 Delay between department batches (0-5000ms)
AISIS_DEPARTMENTS All Comma-separated list of departments to scrape
Curriculum Scraper Performance
CURRICULUM_LIMIT All Limit to first N curriculum programs
CURRICULUM_SAMPLE All Comma-separated list of specific degree codes
CURRICULUM_DELAY_MS 1000 Delay between curriculum requests (0-5000ms) - Balanced default
CURRICULUM_CONCURRENCY 2 Programs to scrape in parallel (1-10) - Balanced default
Regression Detection
BASELINE_DROP_THRESHOLD 5.0 Overall regression alert threshold (%)
BASELINE_DEPT_DROP_THRESHOLD 0.5 Per-department regression threshold (0.0-1.0 = 0%-100% drop)
BASELINE_WARN_ONLY true Warn only (don't fail job) on regression
REQUIRE_BASELINES true Fail job if baselines artifact is missing (prevents data loss). See docs/ingestion.md
TRACK_SUBJECT_PREFIXES false Track per-department subject prefix counts in baselines for regression detection
Department Sanity Checks
SCRAPER_MIN_MA_MATH 50 Minimum MATH courses required for MA (Mathematics) department
SCRAPER_MIN_PE_COURSES 20 Minimum total courses required for PE department
SCRAPER_MIN_NSTP_COURSES 10 Minimum NSTP courses required for NSTP departments
Debugging
DEBUG_SCRAPER false Enable detailed debug logging including subject prefix breakdowns

Data Loss Protection

The scraper includes comprehensive safeguards to prevent data loss from AISIS misrouting or HTML quirks. These protections were implemented after a critical incident where the MA (Mathematics) department returned only 13 Korean-language courses instead of 300+ MATH courses, and the scraper used replace_existing=true to wipe out all correct data.

Department-Level Sanity Checks

The scraper performs automatic sanity checks on critical departments during scraping:

MA (Mathematics) Department:

  • Counts courses with MATH subject prefix
  • Requires minimum of 50 MATH courses (configurable via SCRAPER_MIN_MA_MATH)
  • If count is 0 or below threshold, scrape fails and department is marked as failed
  • Raw HTML is saved to logs/ for debugging

PE (Physical Education) Department:

  • Requires presence of PEPC and/or PHYED courses
  • Enforces minimum total course count (default: 20, configurable via SCRAPER_MIN_PE_COURSES)
  • Detects when required subject prefixes are missing

NSTP Departments:

  • Requires minimum NSTP-prefixed courses (default: 10, configurable via SCRAPER_MIN_NSTP_COURSES)
  • Applies to both NSTP (ADAST) and NSTP (OSCI) departments

When a sanity check fails:

  • The department scrape throws an error
  • The department is marked as failed in scrape summary
  • Department's courses are excluded from Supabase sync
  • Raw HTML response is saved to logs/raw-sanity-check-failed-{term}-{dept}.html
  • Clear error messages are logged for debugging

Per-Department Baseline Tracking

The baseline system now tracks per-department statistics in addition to overall counts:

  • Stores baseline file: logs/baselines/baseline-{term}-departments.json
  • Tracks for each department:
    • row_count: Number of courses
    • prefix_breakdown: Count of courses by subject prefix (e.g., MATH=305, PEPC=79)
  • Detects regressions on a per-department basis before syncing to Supabase
  • Configurable drop threshold (default: 50% via BASELINE_DEPT_DROP_THRESHOLD)

Critical departments (MA, PE, NSTP) that fail regression checks will block replace_existing=true behavior, preventing destructive syncs.

Supabase Sync Hardening

Before syncing schedule data with replace_existing=true:

  1. Pre-sync health check validates all department data
  2. Compares current per-department counts against baselines
  3. Detects critical regressions (e.g., MA dropping from 305 to 13 courses)
  4. If critical departments fail health check:
    • Sync is aborted to prevent data loss
    • Clear error message explains which departments failed
    • replace_existing=true is never sent to Supabase
    • Existing good data in database is preserved

Raw HTML Snapshotting

When sanity checks or health checks fail, the scraper automatically:

  • Creates logs/ directory if it doesn't exist
  • Saves raw HTML response to timestamped file
  • Logs file path for manual inspection and debugging
  • Filename format: logs/raw-{reason}-{term}-{dept}-{timestamp}.html

This allows maintainers to inspect exactly what AISIS returned and diagnose the root cause.

Configuration

All sanity check thresholds are configurable via environment variables:

# MA (Mathematics) department
SCRAPER_MIN_MA_MATH=50          # Minimum MATH courses

# PE (Physical Education) department
SCRAPER_MIN_PE_COURSES=20       # Minimum total courses

# NSTP departments
SCRAPER_MIN_NSTP_COURSES=10     # Minimum NSTP courses

# Per-department regression threshold
BASELINE_DEPT_DROP_THRESHOLD=0.5  # 50% drop triggers regression

# Enable verbose logging
DEBUG_SCRAPER=true

Troubleshooting

If you see sanity check failures:

  1. Check logs for error messages indicating which department failed and why
  2. Inspect raw HTML saved to logs/ directory
  3. Verify AISIS is returning correct data via web browser
  4. Adjust thresholds if legitimate changes occurred (new semester, course restructuring)
  5. Re-run scraper once AISIS issue is resolved

For more details on the baseline system and regression detection, see existing documentation on BASELINE_DROP_THRESHOLD and BASELINE_WARN_ONLY.

License

This project is licensed under the MIT License.

About

Automated AISIS data scraper that runs every 6 hours and updates Google Sheets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors