This project contains a Node.js-based web scraper that automatically logs into AISIS, scrapes institutional data (class schedules and curriculum), and syncs the data to Supabase. The scraper is designed to run on a schedule using GitHub Actions.
- Automated Scraping: Runs on a scheduled basis via GitHub Actions.
- Dynamic Department Discovery: Automatically discovers and scrapes all departments from the AISIS dropdown (IE, LCS, and any future departments are included without code changes).
- Multi-Term Support: Can scrape current term, current + next term (new default), future terms, all available terms, or all terms in the current academic year in one run. See MULTI_TERM_SCRAPING.md.
- Institutional Data Focus: Scrapes class schedules and official curriculum data.
- Supabase Integration: Automatically syncs data to Supabase via Edge Functions.
- Batched Sync Architecture: Two-layer batching prevents 504 timeouts when syncing thousands of records.
- Secure Credential Management: Uses GitHub Secrets for secure storage of credentials.
- Fast Mode: Switched from Puppeteer to Direct HTTP Requests (node-fetch + Cheerio) for speed, stability, and low memory usage.
- Production-Grade: Built with error handling, robust data transformation, and partial failure recovery.
- 🛡️ Data Loss Protection: Comprehensive sanity checks and per-department baseline tracking prevent destructive syncs when AISIS misbehaves (e.g., returns wrong courses). See Data Loss Protection below.
Note: As of the latest update, the default
AISIS_SCRAPE_MODEchanged fromcurrenttocurrent_next. This means the scraper now fetches both the current term and the next term by default. To restore the previous single-term behavior, setAISIS_SCRAPE_MODE=currentin your environment.
- Schedule of Classes: All available class schedules for all departments (runs every 6 hours). Supports multi-term scraping. ✅ Working
- Official Curriculum:
⚠️ EXPERIMENTAL - Curriculum scraping now supported via theJ_VOFC.doendpoint. See Curriculum Scraping Status below for details.
Status:
The curriculum scraper uses the J_VOFC.do endpoint discovered through HAR file analysis and now includes structured parsing:
- GET
J_VOFC.do- Retrieves a form with a dropdown containing all curriculum versions - Parse
<select name="degCode">- Extracts curriculum version identifiers (e.g.,BS CS_2024_1) - POST
J_VOFC.dowithdegCode=<value>- Fetches curriculum HTML for each version - Parse HTML to structured rows - Extracts year/semester headers and course data into structured objects (NEW)
- Sync to Supabase and Google Sheets - Saves flat course rows with columns like schedules (IMPROVED)
The curriculum scraper now produces row-based structured data similar to schedules, with each course as a separate row containing:
deg_code- Degree program codeprogram_label- Human-readable program nameyear_level- 1-4semester- 1-2course_code- Course identifiercourse_title- Course nameunits- Numeric unitsprerequisites- Prerequisites or nullcategory- Course category (M, C, etc.) or null
This enables direct use in Google Sheets with proper columns, matching the schedule scraping behavior.
- May break if AISIS changes the
J_VOFC.dopage layout - Not officially documented or supported by AISIS
- Discovered through network traffic analysis (HAR file)
- Should be treated as best-effort with monitoring
Earlier versions attempted to use the non-existent J_VOPC.do endpoint, which returned HTTP 404. The working alternative J_VOFC.do was discovered later through HAR analysis.
If J_VOFC.do becomes unreliable, consider:
- Scrape public curriculum pages: Extract from
ateneo.edu/college/academics/degrees-majors - Manual curriculum data: Maintain curated JSON from official PDFs
- Request API access: Contact AISIS administrators for official endpoint
For technical details, see docs/CURRICULUM_LIMITATION.md.
You'll need a Supabase project with the appropriate Edge Functions deployed to receive scraped data.
- Create a Supabase project at https://supabase.com
- Deploy the Edge Functions from
supabase/functions/:# Install Supabase CLI npm install -g supabase # Link to your project supabase link --project-ref YOUR_PROJECT_ID # Deploy the functions supabase functions deploy github-data-ingest supabase functions deploy aisis-scraper supabase functions deploy scrape-department supabase functions deploy import-schedules
- Set up the database schema (see
supabase/functions/README.md) - Generate an authentication token for the data ingest endpoint
In your GitHub repository, go to Settings > Secrets and variables > Actions and add the following secrets:
AISIS_USERNAME: Your AISIS usernameAISIS_PASSWORD: Your AISIS passwordSUPABASE_URL: Your Supabase project URL (e.g.,https://your-project-id.supabase.co)DATA_INGEST_TOKEN: The authentication token for your Supabase data ingest endpoint
The scraper now automatically detects the current academic term from AISIS without requiring manual code changes. It reads the term from the Schedule of Classes page dropdown.
To override the term (e.g., for scraping historical data or for CI/scheduled runs), you can set the AISIS_TERM environment variable:
AISIS_TERM=2025-1 npm startOr add it to your .env file:
AISIS_TERM=2025-1
Legacy support: The APPLICABLE_PERIOD environment variable is still supported for backward compatibility, but AISIS_TERM takes precedence if both are set.
If no override is provided, the scraper will auto-detect and use the currently selected term in AISIS. Using an override skips the term auto-detection request, which can speed up startup time in CI environments.
The scraper now automatically discovers departments from the AISIS Schedule of Classes page dropdown without requiring code changes.
- On startup, the scraper fetches the
deptCodedropdown from AISIS and extracts all available department codes - New departments (like IE, LCS) are automatically included in scraping runs
- If the AISIS fetch fails, the scraper falls back to a hardcoded list in
src/constants.js - The
AISIS_DEPARTMENTSenvironment variable can still be used to filter specific departments for testing
- Future-proof: New departments are automatically discovered without code updates
- Always current: Reflects the exact department list AISIS exposes for the current term
- Safe fallback: Uses hardcoded list if AISIS fetch fails (network issues, page structure changes)
- Developer-friendly:
AISIS_DEPARTMENTSfilter still works for local testing
✅ Using 45 departments from AISIS dropdown (dynamic discovery)
🆕 New departments discovered: IE, LCS
The scraper includes automatic regression detection to alert when scraped record counts drop significantly between runs.
- After each scrape, the total record count and per-department counts are saved as a "baseline" in
logs/baselines/baseline-{term}.json - On subsequent runs for the same term, the current count is compared with the previous baseline
- If the count drops by more than a configurable threshold, a warning or error is triggered
# Threshold percentage for triggering regression alert (default: 5.0)
BASELINE_DROP_THRESHOLD=5.0 npm start
# Warn-only mode: log warning but don't fail job (default: true)
# Set to false to fail the job when regression is detected
BASELINE_WARN_ONLY=true npm start📊 Baseline Comparison:
Term: 2025-1
Previous run: 2025-01-15T10:30:00.000Z
Previous total: 4000 records
Current total: 3520 records
Change: -480 records (-12.00%)
⚠️ WARNING: Record count dropped by 480 records (12.00%)
This exceeds the configured threshold of 5.0%
The baseline files are stored locally in logs/baselines/ and are not committed to git (already in .gitignore). In GitHub Actions, these files are ephemeral unless you configure artifact upload.
The scraper includes several performance optimization options:
The sync phase has been optimized with batching and HTTP-level concurrency to reduce total sync time:
# Schedule sync performance
SUPABASE_CLIENT_BATCH_SIZE=2000 npm start # Default: 2000 records per batch
SCHEDULE_SEND_CONCURRENCY=2 npm start # Default: 2 concurrent HTTP requests
# Curriculum sync performance
CURRICULUM_SEND_GROUP_SIZE=10 npm run curriculum # Default: 10 programs per batch
CURRICULUM_SEND_CONCURRENCY=2 npm run curriculum # Default: 2 concurrent HTTP requestsSchedule Sync Optimization:
- SUPABASE_CLIENT_BATCH_SIZE: Controls batch size (default: 2000 records)
- Larger values (e.g., 3000-5000): Fewer HTTP requests, faster sync
- Smaller values (e.g., 500-1000): More granular progress, safer for timeouts
- SCHEDULE_SEND_CONCURRENCY: Controls parallel HTTP requests (default: 2)
- Higher values (3-5): Faster sync but more aggressive
- Lower values (1-2): Conservative, safer for Edge Function limits
Curriculum Sync Optimization:
- CURRICULUM_SEND_GROUP_SIZE: Number of programs grouped per HTTP request (default: 10)
- Reduces HTTP overhead by sending multiple programs in one call
- Edge function internally batches DB operations (500 records per transaction)
- CURRICULUM_SEND_CONCURRENCY: Parallel HTTP requests (default: 2)
- Higher values (3-5): Faster sync for large curriculum sets
- Lower values (1-2): Conservative, reduces Edge Function load
Performance Impact:
- Schedules: For ~4000 courses, sync time reduced from ~15 minutes (43 sequential department sends) to ~3-5 minutes (2-3 batches with concurrency 2)
- Curriculum: For ~450 programs, sync time reduced from ~15 minutes (450 sequential sends) to ~2-4 minutes (~45 grouped sends with concurrency 2)
The Edge Function further splits large batches into smaller database transactions (schedules: 100 by default via GITHUB_INGEST_DB_BATCH_SIZE, curriculum: 500) to prevent individual transaction timeouts.
The scraper includes optional Google Sheets integration for easy data visualization and sharing. When enabled, scraped data is automatically synced to a Google Spreadsheet alongside Supabase.
The GoogleSheetsManager class (in src/sheets.js) uses the Google Sheets API v4 to:
- Clear existing data from the specified sheet tab
- Write headers from the first data object's keys
- Write data rows with automatic type conversion (objects/arrays → JSON strings)
- Auto-format using Google Sheets'
USER_ENTEREDmode for numbers, dates, etc.
-
Create a Google Cloud Service Account:
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Google Sheets API
- Create a Service Account and download the JSON credentials
-
Share your Google Spreadsheet:
- Create a new Google Sheet or use an existing one
- Share it with the service account email (e.g.,
your-service@project.iam.gserviceaccount.com) - Grant "Editor" permissions
-
Configure Environment Variables:
# Base64-encode your service account JSON file GOOGLE_SERVICE_ACCOUNT=$(cat service-account.json | base64 -w 0) # Get your spreadsheet ID from the URL # https://docs.google.com/spreadsheets/d/SPREADSHEET_ID/edit SPREADSHEET_ID=your_spreadsheet_id_here
-
Create Sheet Tabs:
- For schedules: Create a tab named
Schedules - For curriculum: Create a tab named
Curriculum
- For schedules: Create a tab named
The scraper syncs data to specific sheet tabs:
Schedules- Class schedule data with columns:department,term_code,subject_code,section,title,units,time,room,instructor, etc.Curriculum- Curriculum data with columns:deg_code,program_label,year_level,semester,course_code,course_title,units,prerequisites,category
Both schedule and curriculum data are synced as flat rows with proper column headers:
- First row contains field names (auto-detected from data)
- Subsequent rows contain course/curriculum entries
- Complex objects are JSON-stringified for compatibility
- Numbers and dates are auto-formatted by Google Sheets
// Initialize (service account credentials as Base64)
const sheets = new GoogleSheetsManager(process.env.GOOGLE_SERVICE_ACCOUNT);
// Sync schedule data
await sheets.syncData(spreadsheetId, 'Schedules', scheduleRecords);
// Sync curriculum data
await sheets.syncData(spreadsheetId, 'Curriculum', curriculumRecords);Error: "Unable to parse range"
- Make sure a sheet tab with the exact name exists (case-sensitive)
- Create tabs named
SchedulesandCurriculumif they don't exist
Error: "Permission denied"
- Verify the spreadsheet is shared with your service account email
- Grant "Editor" permissions (not just "Viewer")
Error: "API not enabled"
- Enable the Google Sheets API in your Google Cloud project
- Wait a few minutes for the API to become active
The scraper provides extensive configuration options for optimizing performance based on your use case (local development, CI, production).
Enable aggressive optimizations for faster local development:
FAST_MODE=true npm startWhen enabled:
- Skips term auto-detection if
AISIS_TERMis provided - Skips the single test-department validation pass
- Uses minimal batch delays (0ms by default)
- Processes all departments immediately in concurrent batches
Use for: Local development, manual testing, rapid iteration Avoid for: Production CI (may be too aggressive for AISIS server)
Control how many departments are scraped in parallel:
AISIS_CONCURRENCY=12 npm start # Default: 8- Lower values (1-5): More polite to AISIS, safer for stability
- Default (8): Balanced performance and stability
- Higher values (10-20): Faster scraping, more aggressive (use with caution)
Control the delay between batches of departments:
AISIS_BATCH_DELAY_MS=0 npm start # Default: 500ms- 0ms: No delay, maximum speed (use with
FAST_MODE) - 500ms (default): Polite delay for production
- 1000ms+: Very conservative, safest for AISIS stability
Scrape only specific departments (useful for local testing):
AISIS_DEPARTMENTS="DISCS,MA,EN,EC" npm start- Accepts comma-separated list of department codes
- Validates against the dynamically discovered department list (or fallback list if AISIS fetch failed)
- Invalid codes are warned and ignored
- Useful for testing changes without scraping all departments
Example local development run:
FAST_MODE=true \
AISIS_TERM=2025-1 \
AISIS_DEPARTMENTS="DISCS,MA" \
AISIS_CONCURRENCY=2 \
AISIS_BATCH_DELAY_MS=0 \
npm startScrape only the first N curriculum programs:
CURRICULUM_LIMIT=10 npm run curriculum- Useful for local development and testing
- Takes the first N programs from AISIS dropdown
- Default: scrape all programs (typically 50-100+)
Scrape specific curriculum programs by degree code:
CURRICULUM_SAMPLE="BS CS_2024_1,BS ME_2023_1,BS ECE_2024_1" npm run curriculum- Comma-separated list of exact
degCodevalues - Takes precedence over
CURRICULUM_LIMIT - Warns if requested codes are not found in AISIS
- Useful for testing specific programs or incremental updates
Control delay between curriculum requests:
CURRICULUM_DELAY_MS=0 npm run curriculum # Default: 1000ms (balanced mode), 500ms (fast mode)- 0ms: No delay, maximum speed (use for local dev, higher risk)
- 500ms: Fast mode default - good balance of speed and safety
- 1000ms (default): Balanced mode - optimized for reliability
- 2000ms+: Ultra-conservative (opt-in for maximum safety)
Balanced defaults: The 1000ms default provides reliable scraping while maintaining reasonable performance (~10-15 minutes for all curricula).
Scrape multiple curriculum programs in parallel:
CURRICULUM_CONCURRENCY=3 npm run curriculum # Default: 2 (balanced parallelism)- 1: Sequential scraping (ultra-safe mode, opt-in for maximum safety)
- 2 (default): Balanced parallelism - reliable and prevents session bleed
- 3-4: Higher parallelism - faster, increased risk of session bleed
- 5-10: Maximum parallelism - fastest, highest risk of session bleed
Balanced defaults: The default of 2 provides parallel scraping while minimizing AISIS session bleed issues that can occur at higher concurrency levels.
🚀 Curriculum scraping uses balanced defaults for reliability!
The curriculum scraper uses balanced default settings that prioritize reliability while maintaining reasonable performance:
- Delay: 1000ms (balanced mode) or 500ms (fast mode) - prevents AISIS session bleed
- Concurrency: 2 programs in parallel - uses
_scrapeDegreeWithValidationto prevent session bleed - Safety maintained: All requests validated via
_scrapeDegreeWithValidation, AISIS_ERROR_PAGE detection, and retry logic
Expected performance (1000ms delay, concurrency 2):
- 459 programs ÷ 2 = 230 parallel batches
- 230 × (1000ms delay + ~2s request) = ~690 seconds (~11.5 minutes) in delays
- With network overhead and retries: ~10-15 minutes for 459 programs (well under 20-30 minute threshold)
For faster scraping (use FAST_MODE for 500ms delays):
- 459 programs ÷ 2 = 230 parallel batches
- 230 × (500ms delay + ~2s request) = ~575 seconds (~9.6 minutes) in delays
- With network overhead and retries: ~6-10 minutes for 459 programs
Speed vs. Reliability Tradeoff:
- Higher concurrency (>2): Faster but increased risk of AISIS session bleed
- Lower delay (<500ms): Faster but may trigger rate limiting or session bleed
- These defaults (2 concurrency, 1000ms delay) balance speed with reliability
- Session bleed issues observed at concurrency=6, delay=300ms have been eliminated
These defaults have been tested and include robust validation to prevent AISIS session bleed. You can still opt for faster settings at your own risk:
CURRICULUM_DELAY_MS=300 CURRICULUM_CONCURRENCY=6 npm run curriculumExample fast curriculum scraping:
FAST_MODE=true \
CURRICULUM_LIMIT=20 \
CURRICULUM_DELAY_MS=300 \
CURRICULUM_CONCURRENCY=4 \
npm run curriculum# .env for local development
FAST_MODE=true
AISIS_TERM=2025-1
AISIS_DEPARTMENTS=DISCS,MA
AISIS_CONCURRENCY=4
AISIS_BATCH_DELAY_MS=0
CURRICULUM_LIMIT=5
CURRICULUM_DELAY_MS=500
CURRICULUM_CONCURRENCY=2# Use defaults for maximum stability (balanced performance + safety)
env:
AISIS_TERM: '2025-1' # Skip auto-detection for speed
# All other settings use balanced defaults
# AISIS_CONCURRENCY: 8 (default)
# AISIS_BATCH_DELAY_MS: 500 (default)
# CURRICULUM_DELAY_MS: 1000 (default - balanced mode)
# CURRICULUM_CONCURRENCY: 2 (default - balanced parallel with validation)AISIS_TERM=2025-1 \
AISIS_CONCURRENCY=10 \
AISIS_BATCH_DELAY_MS=250 \
CURRICULUM_CONCURRENCY=4 \
npm start && npm run curriculumThe scraper logs detailed timing information for each phase:
- Initialization: Cookie loading, session setup
- Login & validation: AISIS authentication
- Term detection: Auto-detect term (skipped if
AISIS_TERMset) - Department discovery: Fetch available departments from AISIS
- Test department: Single department validation (skipped in
FAST_MODE) - Batch processing: Per-batch timing and progress
- Supabase sync: Database upload timing
- Sheets sync: Google Sheets upload timing
Example output:
⏱ Performance Summary:
Initialization: 0.3s
Login & validation: 2.1s
AISIS scraping: 45.2s
Supabase sync: 12.4s
Sheets sync: 8.7s
Total time: 68.7s
- GitHub Actions: The project has four workflows:
- AISIS – Class Schedule (Current + Next Term) (
.github/workflows/scrape-institutional-data.yml): Runs every 6 hours to scrape class schedules for both the current term and the next term. This is the primary operational workflow that keeps schedule data fresh. Each term is synced separately to thegithub-data-ingestfunction withreplace_existing: trueto safely replace existing data without cross-term issues. - AISIS – Class Schedule (Full Academic Year) (
.github/workflows/aisis-schedule-full-year.yml): Manual trigger to scrape all three semesters (intersession, first, second) for a specified academic year - AISIS – Class Schedule (All Available Terms) (
.github/workflows/scrape-future-terms.yml): Runs weekly to scrape all terms in the current academic year - AISIS – Degree Curricula (All Programs) (
.github/workflows/scrape-curriculum.yml): Runs weekly to scrape official curriculum data
- AISIS – Class Schedule (Current + Next Term) (
- Scraper (
src/scraper.js): This script usesnode-fetchto perform direct HTTP requests andcheerioto parse the HTML, eliminating the need for a headless browser (Puppeteer). This makes the scraper significantly faster and more stable. - Supabase Sync (
src/supabase.js): This script transforms the scraped data and syncs it to Supabase via thegithub-data-ingestEdge Function endpoint. - Main Scripts:
src/index.js: Entry point for scraping class schedules (current term, current+next, or multi-term modes)src/scrape-full-year.js: Entry point for full academic year schedule scrapingsrc/index-curriculum.js: Entry point for scraping curriculum datasrc/term-utils.js: Helpers for term code calculations (next term, etc.)
-
Clone the repository:
git clone https://github.com/CelestialBrain/aisis-scraper.git cd aisis-scraper -
Install dependencies:
npm install
-
Create a
.envfile: Copy the.env.examplefile to.envand fill in your credentials.cp .env.example .env
Your
.envfile should contain:AISIS_USERNAME=your_username AISIS_PASSWORD=your_password SUPABASE_URL=https://your-project-id.supabase.co DATA_INGEST_TOKEN=your_ingest_token # Optional: Override the term for manual scraping (skips auto-detection) # AISIS_TERM=2025-1 # Optional: Performance tuning for Supabase sync # SUPABASE_CLIENT_BATCH_SIZE=2000 -
Run the scraper:
For class schedules (production-ready):
npm start
For curriculum data (experimental - see status above):
npm run curriculum # May return curriculum data or empty arrayFor testing the curriculum endpoint:
node test-curriculum-endpoint.js
This is a fast and stable scraper (v3) that:
- Uses direct HTTP requests for reliability and speed
- Scrapes institutional data (class schedules and experimental curriculum support)
- Syncs directly to Supabase via Edge Functions
- Includes robust error handling and data transformation
To handle large datasets (3000+ schedule records) without timeouts, the system uses two-layer batching with configurable batch sizes for optimal performance.
- Splits large datasets into configurable chunks (default: 2000 records)
- Sends multiple HTTP requests to the Edge Function
- Prevents overwhelming the Edge Function with giant payloads
- Tracks partial failures across batches
- Configurable via
SUPABASE_CLIENT_BATCH_SIZEenvironment variable
- Further splits each request into 100-record database transactions (default)
- Uses
upsertwith correctonConflictkey:term_code,subject_code,section,department - Partial failure handling - one failed batch doesn't block others
- Detailed logging for debugging
- Configurable via
GITHUB_INGEST_DB_BATCH_SIZEenvironment variable (range: 50-500)
Example: Syncing 3783 schedules (optimized)
Client sends: 2 requests × ~2000 records each
↓
Each request: ~20 database batches × 100 records each
↓
Total: ~40 database transactions of 100 records
↓
Result: No timeouts, faster sync (~5-8 minutes vs 14-15 minutes)
Previous architecture (v3.0): 8 requests × 500 records
Client sends: 8 requests × ~500 records each
↓
Each request: 5 database batches × 100 records each
↓
Total: 40 database transactions of 100 records
↓
Result: Slower due to HTTP overhead (14-15 minutes)
This architecture ensures:
- ✅ No 504 Gateway Timeout errors
- ✅ Graceful handling of partial failures
- ✅ Idempotent upserts (safe to re-run)
- ✅ Detailed error logging
For more details, see supabase/functions/README.md.
- All credentials are stored securely in GitHub Secrets or local
.envfiles - The Supabase sync endpoint should be protected with API key authentication
- Never commit your
.envfile to version control
The scraper includes comprehensive tests to ensure all course patterns are correctly parsed:
npm test # Run basic parser tests
npm run test:all # Run all parser tests including PE subject parsing
node tests/test-real-world-patterns.js # Test with real AISIS patternsThe tests validate:
- Decimal course codes:
ENE 13.03i,ENGL 298.66,PEPC 13.03 - Complex section codes:
WXY1,ST1A,PT-GRAD,THES/DISS1-8 - 0-unit enrollment objects:
COMP,SUB-A,SUB-B,THES/DISS,YYY,ODEF,RESID - Special markers:
TBA (~)for special enrollment courses - Graduate courses: 200-300 level courses
- Lab sections:
LAB1-VW,LAB2-VW - PE department subjects:
PEPC 10,NSTP 11/CWTS,PHYED 100.20
After scraping, validate subject distribution across departments:
npm run validate:subjects # Analyze data/courses.json
node src/validate-subjects.js data/custom-courses.json # Custom pathThis script:
- Computes per-department course counts
- Computes per-subject prefix breakdown (e.g., PEPC, NSTP, PHYED within PE department)
- Identifies missing subject families in critical departments
Example output:
📊 Per-Department Summary:
PE ( 79 courses): NSTP=79, PEPC=0, PHYED=0
🔍 Critical Department Analysis:
⚠️ PE: PEPC courses missing (count = 0)
Baseline files are stored in logs/baselines/baseline-{term}.json and track:
- Total record count per term
- Per-department record counts
- Timestamp of scrape
- GitHub Actions metadata (if running in CI)
- Optional: Per-department subject prefix counts (when
TRACK_SUBJECT_PREFIXES=true)
Important: Baseline files are local and not committed to git. To preserve baselines across GitHub Actions runs:
-
Upload baselines as artifacts:
- name: Upload baselines uses: actions/upload-artifact@v3 with: name: baselines path: logs/baselines/
-
Download baselines before running scraper:
- name: Download baselines uses: actions/download-artifact@v3 with: name: baselines path: logs/baselines/ continue-on-error: true # Don't fail if no previous baseline
| Variable | Default | Description |
|---|---|---|
| Authentication | ||
AISIS_USERNAME |
- | Required: AISIS login username |
AISIS_PASSWORD |
- | Required: AISIS login password |
| Data Sync | ||
DATA_INGEST_TOKEN |
- | Supabase ingest endpoint token |
SUPABASE_URL |
- | Supabase project URL |
GOOGLE_SERVICE_ACCOUNT |
- | Base64-encoded service account JSON |
SPREADSHEET_ID |
- | Google Sheets spreadsheet ID |
SUPABASE_CLIENT_BATCH_SIZE |
2000 |
Records per HTTP request to Supabase |
CURRICULUM_SEND_GROUP_SIZE |
10 |
Programs grouped per HTTP request (1-50) |
CURRICULUM_SEND_CONCURRENCY |
2 |
Concurrent curriculum group sends (1-5) |
| Term Configuration | ||
AISIS_TERM |
Auto-detect | Override term code (e.g., 2025-1) |
APPLICABLE_PERIOD |
Auto-detect | Legacy term override (use AISIS_TERM instead) |
AISIS_SCRAPE_MODE |
current_next |
Scrape mode: current, current_next, future, all, or year. See MULTI_TERM_SCRAPING.md |
| Schedule Scraper Performance | ||
FAST_MODE |
false |
Enable fast mode (skip validation, minimal delays) |
AISIS_CONCURRENCY |
8 |
Departments to scrape in parallel (1-20) |
AISIS_BATCH_DELAY_MS |
500 |
Delay between department batches (0-5000ms) |
AISIS_DEPARTMENTS |
All | Comma-separated list of departments to scrape |
| Curriculum Scraper Performance | ||
CURRICULUM_LIMIT |
All | Limit to first N curriculum programs |
CURRICULUM_SAMPLE |
All | Comma-separated list of specific degree codes |
CURRICULUM_DELAY_MS |
1000 |
Delay between curriculum requests (0-5000ms) - Balanced default |
CURRICULUM_CONCURRENCY |
2 |
Programs to scrape in parallel (1-10) - Balanced default |
| Regression Detection | ||
BASELINE_DROP_THRESHOLD |
5.0 |
Overall regression alert threshold (%) |
BASELINE_DEPT_DROP_THRESHOLD |
0.5 |
Per-department regression threshold (0.0-1.0 = 0%-100% drop) |
BASELINE_WARN_ONLY |
true |
Warn only (don't fail job) on regression |
REQUIRE_BASELINES |
true |
Fail job if baselines artifact is missing (prevents data loss). See docs/ingestion.md |
TRACK_SUBJECT_PREFIXES |
false |
Track per-department subject prefix counts in baselines for regression detection |
| Department Sanity Checks | ||
SCRAPER_MIN_MA_MATH |
50 |
Minimum MATH courses required for MA (Mathematics) department |
SCRAPER_MIN_PE_COURSES |
20 |
Minimum total courses required for PE department |
SCRAPER_MIN_NSTP_COURSES |
10 |
Minimum NSTP courses required for NSTP departments |
| Debugging | ||
DEBUG_SCRAPER |
false |
Enable detailed debug logging including subject prefix breakdowns |
The scraper includes comprehensive safeguards to prevent data loss from AISIS misrouting or HTML quirks. These protections were implemented after a critical incident where the MA (Mathematics) department returned only 13 Korean-language courses instead of 300+ MATH courses, and the scraper used replace_existing=true to wipe out all correct data.
The scraper performs automatic sanity checks on critical departments during scraping:
MA (Mathematics) Department:
- Counts courses with
MATHsubject prefix - Requires minimum of 50 MATH courses (configurable via
SCRAPER_MIN_MA_MATH) - If count is 0 or below threshold, scrape fails and department is marked as failed
- Raw HTML is saved to
logs/for debugging
PE (Physical Education) Department:
- Requires presence of
PEPCand/orPHYEDcourses - Enforces minimum total course count (default: 20, configurable via
SCRAPER_MIN_PE_COURSES) - Detects when required subject prefixes are missing
NSTP Departments:
- Requires minimum NSTP-prefixed courses (default: 10, configurable via
SCRAPER_MIN_NSTP_COURSES) - Applies to both
NSTP (ADAST)andNSTP (OSCI)departments
When a sanity check fails:
- The department scrape throws an error
- The department is marked as
failedin scrape summary - Department's courses are excluded from Supabase sync
- Raw HTML response is saved to
logs/raw-sanity-check-failed-{term}-{dept}.html - Clear error messages are logged for debugging
The baseline system now tracks per-department statistics in addition to overall counts:
- Stores baseline file:
logs/baselines/baseline-{term}-departments.json - Tracks for each department:
row_count: Number of coursesprefix_breakdown: Count of courses by subject prefix (e.g.,MATH=305,PEPC=79)
- Detects regressions on a per-department basis before syncing to Supabase
- Configurable drop threshold (default: 50% via
BASELINE_DEPT_DROP_THRESHOLD)
Critical departments (MA, PE, NSTP) that fail regression checks will block replace_existing=true behavior, preventing destructive syncs.
Before syncing schedule data with replace_existing=true:
- Pre-sync health check validates all department data
- Compares current per-department counts against baselines
- Detects critical regressions (e.g., MA dropping from 305 to 13 courses)
- If critical departments fail health check:
- Sync is aborted to prevent data loss
- Clear error message explains which departments failed
replace_existing=trueis never sent to Supabase- Existing good data in database is preserved
When sanity checks or health checks fail, the scraper automatically:
- Creates
logs/directory if it doesn't exist - Saves raw HTML response to timestamped file
- Logs file path for manual inspection and debugging
- Filename format:
logs/raw-{reason}-{term}-{dept}-{timestamp}.html
This allows maintainers to inspect exactly what AISIS returned and diagnose the root cause.
All sanity check thresholds are configurable via environment variables:
# MA (Mathematics) department
SCRAPER_MIN_MA_MATH=50 # Minimum MATH courses
# PE (Physical Education) department
SCRAPER_MIN_PE_COURSES=20 # Minimum total courses
# NSTP departments
SCRAPER_MIN_NSTP_COURSES=10 # Minimum NSTP courses
# Per-department regression threshold
BASELINE_DEPT_DROP_THRESHOLD=0.5 # 50% drop triggers regression
# Enable verbose logging
DEBUG_SCRAPER=trueIf you see sanity check failures:
- Check logs for error messages indicating which department failed and why
- Inspect raw HTML saved to
logs/directory - Verify AISIS is returning correct data via web browser
- Adjust thresholds if legitimate changes occurred (new semester, course restructuring)
- Re-run scraper once AISIS issue is resolved
For more details on the baseline system and regression detection, see existing documentation on BASELINE_DROP_THRESHOLD and BASELINE_WARN_ONLY.
This project is licensed under the MIT License.