ML-assisted signals for scaling, cost, and Kubernetes operations
Quick start · Features · Configuration · Contributing · License
CloudPilot connects Prometheus metrics, Kubernetes, and AWS pricing data to a small set of Python modules that recommend scaling actions, surface cost-oriented hints, tune deployment CPU limits, and flag anomalies. It is built for operators and engineers who want transparent defaults, testable behavior, and explicit guardrails when automation touches production clusters.
flowchart LR
subgraph signals [Data sources]
PR[Prometheus]
K8[Kubernetes]
AW[AWS Pricing]
end
subgraph core [CloudPilot]
CP[Heuristics and ML]
end
subgraph out [Outcomes]
SC[Scaling hints]
CO[Cost hints]
TU[Auto-tuning]
AN[Anomaly signals]
end
PR --> CP
K8 --> CP
AW --> CP
CP --> SC
CP --> CO
CP --> TU
CP --> AN
| Quick start | Clone, environment, and first install |
| Features | What the toolkit does |
| Tech stack | Languages, libraries, and CI |
| Requirements | What you need before running |
| Project layout | Repository map |
| Installation | Extras, uv, and Locust |
| Configuration | Environment variables |
| Usage | CLI and Locust |
| Machine learning artifacts | Models and training |
| AWS and Kubernetes notes | Integration details |
| Testing and quality | Pytest, coverage, audits |
| Roadmap | Planned direction |
| Contributing | How to help |
| License | Legal |
git clone https://github.com/<your-org-or-username>/cloudpilot.git
cd cloudpilot
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -e ".[dev,ml]" --extra-index-url https://download.pytorch.org/whl/cpu
pytest
cloudpilot --versionRuntime-only install (includes optional PyTorch for scripted scaling): pip install -e ".[ml]".
| Capability | What you get |
|---|---|
| Scaling intelligence | TorchScript inference when a model is available; otherwise a safe, deterministic fallback. |
| Cost awareness | EC2 pricing lookups via the AWS Price List API, returned as concise guidance. |
| Kubernetes tuning | Heuristic CPU limit adjustments with an optional dry-run that skips API patches. |
| Anomaly detection | Isolation Forest over metric features; model training is lazy (not at import time). |
| Self-healing | Pod restarts only when explicitly confirmed through configuration—never by default. |
| Load simulation | Poisson-style request timing and a stress-test placeholder for experimentation. |
| Layer | Details |
|---|---|
| Runtime | Python 3.10+ (pyproject.toml) |
| Machine learning | scikit-learn; PyTorch via optional ml extra |
| Cloud & orchestration | boto3, official Kubernetes client |
| Observability | prometheus-api-client |
| Optional load tests | Locust (locustfile.py) |
| Quality gates | Ruff, Mypy, Bandit, pip-audit, pytest, coverage |
| Continuous integration | GitHub Actions on 3.10, 3.11, 3.12 (.github/workflows/ci.yml) |
- Python 3.10 or newer.
- AWS (for pricing): credentials or role available to boto3 (for example standard environment variables or instance metadata).
- Kubernetes (for live tuning or self-heal): a valid kubeconfig and cluster reachability—omit if you only run the test suite with mocks.
.
├── cloudpilot/ # Main package (PEP 561: py.typed)
│ ├── config.py # Central env-based settings
│ ├── scaling.py
│ ├── cost_optimizer.py
│ ├── k8s_autotuner.py
│ ├── anomaly_detector.py
│ ├── load_tester.py
│ └── training_rl_scaler.py
├── tests/
├── cli.py # Same entry as console script `cloudpilot`
├── locustfile.py
├── pyproject.toml
├── uv.lock
├── .github/workflows/ci.yml
├── CONTRIBUTING.md
├── LICENSE
└── README.md
1. Clone (use your fork or upstream URL).
git clone https://github.com/<your-org-or-username>/cloudpilot.git
cd cloudpilot2. Virtual environment (recommended).
python -m venv .venv
source .venv/bin/activate # Linux / macOS
# .venv\Scripts\activate.bat # Windows cmd
# .venv\Scripts\Activate.ps1 # Windows PowerShell3. Install one of the following.
| Goal | Command |
|---|---|
| Application + ML extra | pip install -e ".[ml]" |
| Full developer + ML (matches CI toolset) | pip install -e ".[dev,ml]" --extra-index-url https://download.pytorch.org/whl/cpu |
The CPU PyTorch index keeps wheels smaller on Linux, macOS, and typical CI images. For CUDA builds, drop the extra index and install the wheel set that matches your platform.
Reproducible installs with uv
uv sync --all-extrasThe first resolve may pull a large PyTorch artifact when ml is included.
Optional: Locust
pip install locustDependency extras (declared under [project.optional-dependencies] in pyproject.toml):
| Extra | Includes |
|---|---|
ml |
torch>=2.0 for scripted scaling |
dev |
pytest, coverage, Ruff, Mypy, Bandit, pip-audit, types-PyYAML |
Combine with: pip install -e ".[dev,ml]".
Note.
requirements.txtdocuments install patterns only; it does not pin versions. Preferpyproject.tomland, when using uv,uv.lock.
All settings are read from the environment. Source of truth: cloudpilot/config.py.
| Variable | Default | Role |
|---|---|---|
CLOUDPILOT_PROMETHEUS_URL |
http://localhost:9090 |
Prometheus base URL |
CLOUDPILOT_PROMETHEUS_DISABLE_SSL |
1 (truthy) |
Skip TLS verification for Prometheus |
CLOUDPILOT_SELF_HEAL_CONFIRM |
unset | Must be 1, true, yes, or on to allow destructive pod deletes in self_heal |
CLOUDPILOT_AWS_PRICING_REGION |
us-east-1 |
Region for the Pricing API client |
CLOUDPILOT_K8S_DRY_RUN |
unset | If truthy, tuning runs without patching the cluster |
Safety. Pod deletion is opt-in by design. Without
CLOUDPILOT_SELF_HEAL_CONFIRM, self-heal reports a skip instead of mutating the cluster.
The cloudpilot command (or python cli.py) exposes:
| Action | Example |
|---|---|
| Scaling recommendation | cloudpilot scale --cpu 80 --mem 70 --req 0.8 --latency 100 --demand 0.9 |
| Cost hint | cloudpilot cost --instance-type m5.large |
| Deployment tuning | cloudpilot tune --deployment your-deployment --namespace default |
| Version | cloudpilot --version |
For scale, --demand must lie in [0, 1].
locust -f locustfile.pyThen open the Locust UI in your browser to control the scenario.
- Inference: With
torchinstalled, CloudPilot searches forrl_scaling_model.ptas packaged data undercloudpilot/, then on disk beside the package. Missing file or missingtorchyields a stable heuristic outcome (Maintain). - Training a placeholder model: With the
mlextra:python -m cloudpilot.training_rl_scalerwritesrl_scaling_model.ptin the working directory. Package or mount that file where your runtime expects it.
- AWS: Pricing filters target common Linux / shared-tenancy / regional product rows. Extend or change filters in code if you need other operating systems or commercial terms.
- Kubernetes: The client uses default kubeconfig discovery. Use
CLOUDPILOT_K8S_DRY_RUNto exercise tuning logic without applyingpatch_namespaced_deployment.
Default pytest options exclude @pytest.mark.integration tests (see pyproject.toml).
pytestCI-style run (coverage + JUnit):
pytest --junitxml=junit.xml -q --cov=cloudpilot --cov=cli --cov-report=xml --cov-report=termCoverage enforces fail_under = 45 when reporting is enabled.
pytest -m integration # only integration-marked tests
py -m pytest # Windows launcher if `pytest` is not on PATHSecurity and supply chain (also executed in CI):
bandit -r cloudpilot -c pyproject.toml
pip freeze > freeze.txt && pip-audit -r freeze.txt --desc on && rm freeze.txtfreeze.txt is gitignored—do not commit it.
- RL training grounded in real workload history.
- Broader cloud pricing (GCP, Azure).
- Stronger anomaly models (for example sequence models or autoencoders).
- Operator-focused dashboard.
- Deeper integration with industrial load and stress tools.
Guidelines, hooks, and review expectations live in CONTRIBUTING.md. Issues and pull requests are welcome.
Released under the MIT License.
Copyright (c) 2025 Matéo H. Petel.