Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
.git
.venv
venv
data
# data # REMOVED: Need data/tld_probs.json for 8-feature model
mlruns
mlartifacts
outputs
Expand Down Expand Up @@ -30,8 +30,9 @@ mlflow.db
.bandit
.flake8
pytest.ini
requirements*.txt
requirements.txt
requirements.in
# requirements-docker.txt is needed for Docker builds!

# Environment & Secrets
.env
Expand Down
54 changes: 33 additions & 21 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,35 +1,47 @@
# Judge backend: stub | llm
# ============================================================
# JUDGE CONFIGURATION
# ============================================================
JUDGE_BACKEND=stub

# LLM Judge (used when JUDGE_BACKEND=llm)
# LLM Judge (only used when JUDGE_BACKEND=llm)
OLLAMA_HOST=http://localhost:11434
JUDGE_MODEL=llama3.2:1b
JUDGE_TIMEOUT_SECS=12
# Optional: store models off C: to save space
# OLLAMA_MODELS=D:\ollama\models

# MongoDB Audit Logging (Optional)
MONGO_URI=
MONGO_DB=phishguard

# Thresholds (use the tuned URL-only thresholds by default)
THRESHOLDS_JSON=configs/dev/thresholds.json

# ============================================================
# MODEL SERVICE CONFIGURATION
# ============================================================

# Configuration file path (can be overridden for different environments)
# Configuration file path
CONFIG_PATH=configs/dev/config.yaml

# Primary model (production model used for decisions)
MODEL_PATH=models/dev/model_7feat.pkl
MODEL_META_PATH=models/dev/model_7feat_meta.json
# Primary model (8-feature production model)
PRIMARY_MODEL_PATH=models/dev/model_8feat.pkl
PRIMARY_META_PATH=models/dev/model_8feat_meta.json

# Shadow testing (DISABLED for production)
SHADOW_ENABLED=false
SHADOW_MODEL_PATH=models/dev/model_7feat.pkl
SHADOW_META_PATH=models/dev/model_7feat_meta.json

# Service URLs
MODEL_SVC_URL=http://localhost:9000
GATEWAY_PORT=8000
MODEL_SVC_PORT=9000

# Thresholds (gray-zone policy bands)
THRESHOLDS_JSON=configs/dev/thresholds.json

# ============================================================
# DATA & LOGGING
# ============================================================

# MongoDB Audit Logging (Optional - disabled by default)
MONGO_URI=
MONGO_DB=phishguard

# Shadow testing (A/B testing with 8-feature model)
SHADOW_ENABLED=true
SHADOW_MODEL_PATH=models/dev/model_8feat.pkl
SHADOW_META_PATH=models/dev/model_8feat_meta.json
# Logging
LOG_LEVEL=INFO

# Model service URL (used by gateway to call model service)
MODEL_SVC_URL=http://localhost:9000
# Optional: Ollama models storage (uncomment if needed)
# OLLAMA_MODELS=D:\ollama\models
15 changes: 9 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# .github/workflows/ci.yml — Option A (paused CI for feature/dev)
# .github/workflows/ci.yml — CI/CD for 8-feature PhishGuard project
name: Tests
on:
pull_request:
branches: ["main"] # only runs on PRs into main
workflow_dispatch: # allow manual runs from Actions tab
branches: ["main"] # only runs on PRs into main
workflow_dispatch: # allow manual runs from Actions tab

jobs:
Tests:
Expand All @@ -13,17 +13,20 @@ jobs:
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: |
- name: Install dependencies
run: |
python -m pip install --upgrade pip
# Skip pywin32 on Linux runners, install everything else
grep -v "pywin32" requirements.txt > requirements-linux.txt || cp requirements.txt requirements-linux.txt
pip install -r requirements-linux.txt
pip install pytest pytest-cov black isort flake8 mypy
- name: Set PYTHONPATH
run: echo "PYTHONPATH=$PYTHONPATH:$(pwd)/src:$(pwd)" >> $GITHUB_ENV
- run: |
- name: Code quality checks
run: |
black --check .
isort --check-only .
flake8 .
mypy src
- run: python -m pytest tests/ -q
- name: Run tests
run: python -m pytest tests/ -v --tb=short
23 changes: 17 additions & 6 deletions .github/workflows/data-contract.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,24 @@ jobs:
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -U pip pandas numpy
- name: Run data contract check (if file exists)
- run: pip install -U pip pandas numpy great-expectations
- name: Set PYTHONPATH
run: echo "PYTHONPATH=$PYTHONPATH:$(pwd)/src:$(pwd)" >> $GITHUB_ENV
- name: Run data contract check (8-feature model)
shell: bash
run: |
CSV="data/processed/phiusiil_clean_urlfeats.csv"
if [ -f "$CSV" ]; then
python scripts/ge_check.py --csv "$CSV"
# Check for 8-feature model data (current)
CSV_V2="data/processed/phiusiil_features_v2.csv"
# Legacy fallback
CSV_LEGACY="data/processed/phiusiil_clean_urlfeats.csv"

if [ -f "$CSV_V2" ]; then
echo "Found 8-feature model data: $CSV_V2"
python scripts/ge_check.py --csv "$CSV_V2"
elif [ -f "$CSV_LEGACY" ]; then
echo "Found legacy data: $CSV_LEGACY"
python scripts/ge_check.py --csv "$CSV_LEGACY"
else
echo "No processed CSV found ($CSV); skipping."
echo "No processed CSV found. Checked: $CSV_V2, $CSV_LEGACY"
echo "Skipping data contract validation."
fi
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ venv/

# Python cache
__pycache__/
**/__pycache__/
*.py[cod]
*$py.class
*.so
Expand Down Expand Up @@ -43,4 +44,6 @@ outputs/*.csv

mlflow.db

docs/*.md


66 changes: 66 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,72 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.2.0] - 2025-10-15

### Added
- **8-Feature Model**: Complete upgrade to production-ready 8-feature phishing detection
- `IsHTTPS`: Binary HTTPS indicator for security baseline
- `TLDLegitimateProb`: Bayesian TLD legitimacy probability with 1401+ TLD dataset
- `CharContinuationRate`: Character repetition pattern detection
- `SpacialCharRatioInURL`: Special character density analysis
- `URLCharProb`: URL character sequence probability scoring
- `LetterRatioInURL`: Alphabetic character ratio for readability assessment
- `NoOfOtherSpecialCharsInURL`: Special character count for complexity analysis
- `DomainLength`: RFC-compliant domain length validation
- **Enhanced Judge System**: Modernized LLM integration with 8-feature model
- Updated judge contracts to use production features
- Enhanced stub logic with sophisticated heuristics
- Improved LLM prompts with detailed feature descriptions
- Graceful fallback from modern to legacy features
- **Comprehensive Test Suite**: 52 tests with 100% pass rate
- Updated all tests for 8-feature model compatibility
- Enhanced integration tests for microservice communication
- Modernized judge system tests with production features
- Fixed whitelist behavior validation

### Changed
- **Great Expectations**: Updated data validation for 8-feature model
- Migrated from 3-feature legacy validation to 8-feature production validation
- Updated thresholds and expectations for new feature ranges
- Enhanced data contract validation with feature-specific checks
- **Judge System Architecture**: Complete alignment with production features
- FeatureDigest contract updated with 8 required + 3 optional legacy fields
- Enhanced decision logic using modern feature signals
- Improved context and audit trail with comprehensive feature logging
- **GitHub Workflows**: Updated CI/CD for modern project structure
- Enhanced data contract workflow with 8-feature model support
- Updated CI workflow with better error reporting
- Added fallback logic for legacy data file compatibility

### Removed
- **Feature Service**: Eliminated redundant microservice
- Removed `src/feature_svc/` directory and related code
- Updated Docker compose to remove feature service dependency
- Cleaned up unused `docker/feature.Dockerfile`
- Streamlined architecture to gateway + model services only
- **Legacy Scripts**: Deprecated obsolete feature extraction
- Identified `scripts/materialize_url_features.py` as obsolete
- Removed references to deprecated 3-feature model components

### Fixed
- **Docker Configuration**: Enhanced for 8-feature model deployment
- Added `data/` directory to Docker images for TLD probability data
- Updated environment variables for proper service communication
- Fixed `.dockerignore` to include necessary data files
- **Test Infrastructure**: Resolved all compatibility issues
- Fixed whitelist behavior in integration tests
- Updated API contract expectations for current implementation
- Resolved version mismatches and dependency issues
- Enhanced test reliability with non-whitelisted test domains

### Technical Details
- **Feature Engineering**: Advanced URL-only features with statistical and linguistic analysis
- **Data Validation**: 31 comprehensive Great Expectations rules for production data quality
- **Performance**: Maintained 204ms API response time with enhanced feature extraction
- **Compatibility**: Backward compatibility maintained through optional legacy feature support

---

## [0.1.0] - 2025-09-17

### Added
Expand Down
Loading