CloudSentinel
Multi-cloud chaos engineering platform for Kubernetes.
CloudSentinel is a production-grade chaos engineering platform that lets you:
- :material-server-network: Provision Kubernetes clusters (Kind, EKS, AKS, GKE) through a single Python CLI
- :material-fire: Inject real faults into any node (CPU stress, memory pressure, network latency, disk fill)
- :material-chart-line: Collect node-level metrics through a lightweight Go agent running as a DaemonSet
- :material-monitor-dashboard: Drive scenarios from a REST API, a web dashboard, or the command line
- :material-robot: Detect anomalies in collected metrics with an Isolation Forest model
v0.2.0 highlight
All four fault types are now fully implemented and validated end-to-end on a real Kind cluster. See the v0.2.0 release notes for the full changelog.
Try it in three commands
The make demo target spins up a local Kind cluster, builds and loads the agent image, deploys the DaemonSet, and exposes the agent's gRPC + Prometheus endpoints on localhost. About three minutes from clone to a working chaos engineering setup.
For the full hands-on guide with copy-paste commands for every fault type, head to the Playground.
Components
| Component | Language | Role |
|---|---|---|
| Orchestrator | Python + Poetry | CLI that provisions clusters via Terraform and runs chaos scenarios |
| Agent | Go 1.24 | DaemonSet pod exposing Prometheus metrics + gRPC fault injection (4 real fault types) |
| API | Python (FastAPI + SQLModel) | REST API persisting scenarios in SQLite |
| Dashboard | TypeScript (Next.js + Tailwind) | Web UI to list, create, and run scenarios |
| ML Pipeline | Python (scikit-learn) | Isolation Forest anomaly detection, F1 = 0.98 |
What's verified
Every fault implementation is tested with measurements that prove the behavior, not just check return codes:
| Fault | Verified by | Result |
|---|---|---|
| CPU Stress | syscall.Getrusage |
1.5 CPU-seconds per 1 wall-second with 2 workers at 80% intensity |
| Memory Pressure | runtime.MemStats |
100 MB requested → 100 MB allocated → 0 MB after Stop+GC |
| Disk Fill | os.Stat |
10 MiB requested → 10485760 bytes exact, fsync verified |
| Network Latency | Mock TCExecutor | Idempotent Stop, no leaked qdiscs on Start failure |
35 unit tests pass in approximately 2 seconds. End-to-end validated on Kind v0.23.0.
Quick links
- :material-rocket-launch: Getting Started — install prerequisites and run the agent
- :material-gamepad-variant: Playground — copy-paste commands for every fault type
- :material-blueprint: Architecture — Strategy Pattern, Registry, Mock TCExecutor
- :material-brain: ML Pipeline — Isolation Forest training and inference
Roadmap
| Niveau | Status | Highlights |
|---|---|---|
| Niveau 1 — Real fault injection | :white_check_mark: shipped (v0.2.0) | 4 fault types, Strategy Pattern, robust rollback, make demo |
| Niveau 2 — Scenarios as Code | :hourglass: in design | Declarative YAML with Prometheus assertions |
| Niveau 3 — LLM-assisted | :hourglass: planned | Auto-generate hypotheses from cluster topology |
| Niveau 4 — Helm + CRD | :hourglass: planned | GitOps-friendly deployment |
| Niveau 5 — CNCF Sandbox | :hourglass: planned | Multi-region, KubeCon talk, comparison page |
Why CloudSentinel?
Cloud-native systems fail in surprising ways. Chaos engineering is the practice of deliberately injecting those failures in controlled conditions so your team can observe and fix them before they happen in production.
Tools like Chaos Mesh and LitmusChaos exist, but I wanted to understand the internals: how do you actually consume CPU on a node? How do you allocate memory that the kernel does not swap? How do you guarantee rollback when the agent crashes?
CloudSentinel answers those questions with code you can read, tests that prove the behavior, and a single make demo to try it on your laptop.