feat: add MiniMax as LLM provider with Guardian threat protection#1
feat: add MiniMax as LLM provider with Guardian threat protection#1octo-patch wants to merge 13 commits into
Conversation
Community edition (MIT framework + open-core model): - 5-category threat detection (instructionOverride, jailbreakActivation, safetyBypass, roleHijacking, systemPromptLeaks) - HMAC-SHA256 license key validator (PRO/ENT tiers) - License-aware PatternAnalyzer and SemanticAnalyzer with 4-step asset resolution chain (assets_dir -> ~/.ethicore -> package) - Async Guardian orchestrator with OpenAI, Anthropic, and Ollama providers - ML inference engine with graceful heuristic fallback (no torch required) - Full pytest suite: 100 passing, 61 skipped (require license + asset bundle) - pyproject.toml build config; wheel verified clean (no proprietary assets) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…framing Transforms the minimal technical reference into a trust-building document that leads with the problem (prompt injection shipped without a real defense), backs it with technical proof (4-layer pipeline, ONNX offline inference, ~15ms p99 latency), and makes the ethical conviction concrete through the Guardian Covenant framework reference. Key changes: - Hero: 'Only' positioning statement + founding insight sentence - Added 'See It Work' section with attack demo in 4 lines - Added 'Why Offline Inference Matters' section (key differentiator vs cloud APIs) - Expanded Community vs Licensed comparison table (30 rows, all categories named) - Fixed category count: 30 (was incorrectly stated as 25+ in old README) - Added Guardian Covenant framework reference with link placeholder - Added Community & Discussions section to activate GitHub Discussions - Closing line: the conviction sentence that anchors the brand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…users Previous version incorrectly framed Guardian SDK as protecting users from AI. Corrected to the accurate frame: Guardian protects the developer's AI system (and its integrity, data, and designed behavior) from adversarial attackers using prompt injection, jailbreaks, and role hijacking. Key changes: - Hero: focuses on real-time threat detection and blocking before model context - Rewrote opening to frame the attack surface and the defender (the developer) - Added 'What It Defends Against' section listing specific attack vectors - Guardian Covenant reframed: developer's commitment to defend what they build - Closing: 'You built something that people rely on. Defend it.' - Removed all language implying users are protected FROM the AI Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ncoding Critical bug fixes discovered and resolved during full-suite audit: ONNX / ML layer (ml_inference_engine.py, guardian.py) - MLInferenceEngine now accepts assets_dir param; SimpleOrchestrator passes it correctly — guardian-model.onnx was never being located - Added ONNX Runtime inference path with calibration gate; falls back to heuristics if model outputs >0.4 avg probability on benign inputs (model requires retraining before contributing to scoring) - Fixed unnormalized linguistic features: raw len(text) -> min(1.0, len/500) preventing ONNX input saturation Licensed pattern library (threat_patterns.py) - Community stub now performs license-aware dynamic loading at import time - `from ethicore_guardian.data.threat_patterns import THREAT_PATTERNS` transparently returns 30-category licensed library when ETHICORE_LICENSE_KEY + asset bundle are present; falls back to community 5-category stub otherwise - Fixes: 53 licensed-tier test failures now pass (157/157 total) Semantic layer (semantic_analyzer.py) - vocab.json + special_tokens.json confirmed in asset bundle and package data; semantic analyzer uses full 30,522-token vocabulary Scoring and display (guardian.py, threat_detector.py) - Layer votes display was always showing BLOCK regardless of actual analysis result; fixed to threshold-based per-layer decisions - Cleaned up SimpleOrchestrator analyzer wiring Windows / encoding (all analyzer files, __init__.py) - Replaced emoji in print() calls with ASCII equivalents across all modules to prevent UnicodeEncodeError on cp1252 terminals Test suite - 96/96 community tests pass (no license key required) - 157/157 licensed tests pass (with ETHICORE_LICENSE_KEY) - 15/15 attack detection in live demo (100%)
…tegories
Licensed tier (threat_patterns_licensed.py):
- Expanded from 241 patterns / 30 categories → 500 patterns / 51 categories
- Added 21 new attack-vector categories sourced from OWASP LLM Top 10 2025,
MITRE ATLAS, Anthropic red-team research, Garak probe taxonomy, and the
PLINY / social-media jailbreak community (X/Twitter, Reddit r/jailbreak):
crescendoAttack, manyShotJailbreaking, cipherObfuscation,
authorityImpersonation, sandboxExemption, delimiterInjection,
outputFormatEscape, persistentPersona, contextWindowFlooding,
falsePermissionClaim, legalJurisdictionBypass, professionalAuthorityBypass,
researchExemption, reversePsychology, contrastiveExtraction,
metaInstructionAttack, memorySeedingAttack, adversarialFormatting,
goalHijackingChain, negationBypass, plinyStyleJailbreak
- Bulked up 3 thin existing categories (trainingDataExtraction,
emotionalManipulation, multiTurnSetup) to 10-12 patterns each
- Bulked up 7 underpowered categories (harmfulContentGeneration,
piiHarvesting, commandInjection, dataExfiltration, urgencyExploit, etc.)
- Semantic fingerprints: 234 → 444 across all 51 categories
- Synced live asset to ~/.ethicore/data/threat_patterns_licensed.py
Scripts:
- scripts/regenerate_embeddings.py: rewrote to be license-aware; resolves
output path via CLI args > ETHICORE_ASSETS_DIR > ~/.ethicore > package;
reads licensed fingerprints (444) when ETHICORE_LICENSE_KEY is set
- scripts/retrain_guardian_model.py: new script — generates synthetic
training data from fingerprint library, trains sklearn MLPClassifier,
exports to ONNX with correct dense_1_input/dense_4 interface, runs
calibration gate before writing (Principle 14: Divine Safety)
Community patterns (threat_patterns.py):
- Added 6 gap-fix patterns to instructionOverride, safetyBypassAttempt,
roleHijacking to close real-world jailbreak gaps
Tests:
- test_phase4_threat_library.py: updated hard count assertions (30→≥51,
234→≥444, 235→≥500); added v1.2.0 category coverage checks for all 21
new categories; switched to >= comparisons for forward compatibility
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
retrain_guardian_model.py: - Full rewrite: 20 000 samples (10k threat / 10k benign) with real MiniLM semantic embeddings on every training sample — eliminates the feature-starvation bug where only 6/127 features were non-zero - 600+ benign templates across 6 domains; 75 assistant-phrasing phrases oversampled 3x to prevent false positives on 'how can I help' inputs - 45 hard-negative security-research sentences (labeled BENIGN) prevent flagging legitimate AI-safety research - Fixed self-check: now compares sklearn vs ONNX probabilities using the calibration cal_X vectors rather than [0.01]*27 placeholder embeddings - Fixed ONNX Gather axis bug: explicit 'probabilities' string match prevents matching 'label [N]' (1D) instead of 'probabilities [N,2]' - Fixed Windows console Unicode errors: replaced all checkmark/arrow symbols with ASCII [OK]/[FAIL]/[WARN] - Corrected docstring sample/architecture defaults (20000, 128,64) .gitignore: - Added fix_*.py, *.bak, *.tmp to suppress scratch/temp files Trained model stats (v1.2.0 release): Samples: 20 000 | Accuracy: 99.83% | AUC-ROC: 0.9996 Calibration: avg benign prob 0.0337 (all 3 texts < 0.11) — PASSED Output: guardian-model.onnx 98 KB (gitignored; in asset bundle) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…res()
Root cause: _build_feature_vector() used text-derived behavioral proxy
features (char_len/500, word_count/100, ...) while extract_features()
uses sentinel defaults ([0.5, 1.0, 0.0, 0.0, ...]) when no behavioral
session data is available. Features [0] and [1] differed by +0.44 and
+0.93 respectively — the model learned those high sentinel values as
threat-correlated, causing avg_benign_prob=0.990 at MLInferenceEngine
load-time calibration and the engine falling back to heuristics.
Fixes applied to retrain_guardian_model.py:
1. _build_feature_vector() now mirrors extract_features() exactly:
- Behavioral [0:40]: sentinel defaults [0.5, 1.0, 0.0, 0.0, ...]
- Linguistic [40:75]: same 5 text-derived computations as engine
- Technical [75:100]: sentinel defaults [0.1, 0.0, ...]
- Semantic [100:127]: real MiniLM or [0.01]*27 null placeholder
Verified: 0 feature mismatches across all test texts.
2. Null-semantic injection: 20% of training samples use [0.01]*27 for
the semantic slot, teaching the model that null semantic signal
does not imply threat (handles calibration + SemanticAnalyzer
unavailable edge cases gracefully).
3. Calibration gate in retrain script now uses [0.01]*27 (matching
extract_features exactly) instead of hash-based fallback — ensures
the gate that passes here is the same gate MLInferenceEngine runs.
Result (v1.2.0 final model — hash a9433737b58720c8):
Accuracy: 95.47% AUC-ROC: 0.9935
Calibration: avg benign prob 0.020 (all 3 texts < 0.03) -- PASSED
Engine load: guardian-model-onnx calibration passed (avg: 0.020)
Smoke test: 3 benign ALLOW, 3 threats BLOCK (ML votes 73-99%)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…omment _SECRET_MASKED has been populated since v1.2.0 retrain; the "all zeros" warning comment was leftover setup boilerplate and was factually incorrect. Removed to prevent confusion for any future reviewer of the source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove ethicore_guardian/analyzers/ and guardian.py from git tracking (core IP — distributed via paid asset bundle only, not open source) - Gitignore analyzers/, guardian.py, and proprietary test files to prevent accidental future commits - Update README: 51 categories / 500+ patterns / 444+ fingerprints - Add bi-directional six-layer pipeline docs (pre-flight + post-flight) - Update Community vs Licensed table with post-flight and learning rows - Update GuardianConfig reference with Phase 3 parameters History scrub required: analyzers/ and guardian.py exist in earlier commits and must be purged with git-filter-repo before history is clean.
Phase 3 release — OutputAnalyzer (post-flight gate) + AdversarialLearner (closed-loop learning). Adds analyze_response() public API and GuardianConfig output/learning fields.
- Add 80-entry _BENIGN_CASUAL domain: short greetings, informal requests,
phrases containing 'benign'/'normal' — the exact inputs that produced
false positives (ML BLOCK on single-word greetings, 'Hello, help me...')
- Over-sample _BENIGN_CASUAL 4x in _ALL_BENIGN pool; add dedicated 20%
sampling path in _make_benign_sample() (previously 0% representation)
- Increase default --samples from 20,000 to 30,000 (15k threat / 15k benign)
to ensure all 51 licensed categories are thoroughly represented
- Adjust _make_benign_sample() weights:
20% casual (new), 20% assistant phrasing, 15% hard negatives, 45% pool
- Update calibration gate hint and docstring to reference 30,000
Licensed retrain results (51 categories / 444 fingerprints):
Accuracy: 95.33% | AUC-ROC: 0.9940 | Avg benign prob: 0.0139
Calibration gate: PASSED
Add MiniMax (https://www.minimax.io) as a first-class provider integration for Guardian SDK. MiniMax offers powerful LLM models (M2.7, M2.5) through an OpenAI-compatible API, and this provider wraps MiniMax-configured OpenAI clients with the same threat detection pipeline used for OpenAI and Anthropic. Changes: - Add minimax_provider.py with MiniMaxProvider, ProtectedMiniMaxClient, and create_protected_minimax_client() convenience factory - Add MiniMax auto-detection in get_provider_for_client() via base_url - Add minimax optional dependency group in pyproject.toml - Add MiniMax provider example and install instructions to README - Add 30 tests (22 unit + 5 integration + 3 constant tests)
|
Hey @octo-patch, really appreciate you taking the time to contribute to the Guardian SDK! It's clear you read the codebase carefully: the proxy structure, async/sync handling, and docstrings all follow the existing patterns closely, and the mock-based test suite is on par. Great first PR. Before we merge, there are two things that need to be addressed: guardian.py was not updated Fail-open on empty prompt (minimax_provider.py, line 283) Two smaller things worth a follow-up (not blocking): The base_url string match works here since it's nested inside the OpenAI module check, but a comment explaining the reasoning would help future maintainers. P.S sorry this took a little bit getting back to you! |
Summary
Add MiniMax as a first-class provider integration for Guardian SDK. MiniMax offers powerful LLM models (M2.7, M2.5) through an OpenAI-compatible API at
https://api.minimax.io/v1. This provider wraps MiniMax-configured OpenAI clients with the same multi-layer threat detection pipeline used for OpenAI and Anthropic.Changes
ethicore_guardian/providers/minimax_provider.py—MiniMaxProvider,ProtectedMiniMaxClient,ProtectedChat,ProtectedCompletions, andcreate_protected_minimax_client()convenience factoryget_provider_for_client()inbase_provider.pyto detect MiniMax clients by checkingbase_urlfor 'minimax'minimaxoptional dependency group inpyproject.toml(usesopenai>=1.0.0)tests/test_minimax.py— 22 unit tests + 5 integration tests + 3 constant testsSupported Models
Usage
Test plan