🎮 Playground
Try CloudSentinel in three commands on your local machine. Everything below is real, not a simulation — these commands create a working Kubernetes cluster, deploy the agent, and inject genuine faults that consume CPU, memory, disk, and network bandwidth.
Prerequisites
You need Docker, kind ≥ 0.20, kubectl ≥ 1.28, and Go 1.24+ installed locally. Run make check-prereqs to verify your setup.
⚡ Quick start
Clone the repository, change into it, and run the demo target:
git clone https://github.com/mamoudou-cheikh-kane/cloudsentinel cd cloudsentinel make demo
The make demo target performs six steps automatically:
- Verifies that
docker,kind, andkubectlare installed - Builds the agent Docker image (
cloudsentinel-agent:0.2.0) - Creates a local Kind cluster named
cloudsentinel-demo - Loads the freshly built image into the Kind node
- Applies the DaemonSet manifest and waits for rollout
- Port-forwards
:50051(gRPC) and:9100(Prometheus metrics)
After about three minutes, you have a working chaos engineering setup on your laptop.
💥 Inject your first fault
Once make demo is running and the agent pod is Ready, open another terminal and inject a real CPU stress fault:
grpcurl -plaintext -d '{ "type": "FAULT_TYPE_CPU_STRESS", "duration_seconds": 30, "parameters": {"intensity": "80", "workers": "4"} }' localhost:50051 cloudsentinel.agent.v1.AgentService/InjectFault
The agent will spawn 4 worker goroutines that consume 80% of CPU for 30 seconds, then auto-cleanup. Verified by `syscall.Getrusage`: 1.5 CPU-seconds per 1 wall-second.
grpcurl -plaintext -d '{ "type": "FAULT_TYPE_MEMORY_PRESSURE", "duration_seconds": 30, "parameters": {"size_mb": "200"} }' localhost:50051 cloudsentinel.agent.v1.AgentService/InjectFault
Allocates 200 MB and touches every 4 KiB page every 200 ms to keep it resident. Verified by `runtime.MemStats`: 200 MB requested → 200 MB allocated → 0 MB after Stop+GC.
grpcurl -plaintext -d '{ "type": "FAULT_TYPE_DISK_FILL", "duration_seconds": 60, "parameters": {"size_mb": "100"} }' localhost:50051 cloudsentinel.agent.v1.AgentService/InjectFault
Writes 100 MiB in 1 MiB chunks with `fsync` to make sure the bytes hit durable storage. Verified by `os.Stat`: 100 MiB requested → 104857600 bytes exact.
grpcurl -plaintext -d '{ "type": "FAULT_TYPE_NETWORK_LATENCY", "duration_seconds": 30, "parameters": {"delay_ms": "200", "jitter_ms": "20", "interface": "eth0"} }' localhost:50051 cloudsentinel.agent.v1.AgentService/InjectFault
Adds 200 ms ± 20 ms of latency to the node's `eth0` interface using `tc` and `netem`. The fault is removed cleanly on Stop, and the init container cleans orphan rules at pod startup.
📺 Watch the demo
A short video walkthrough showing make demo from clone to fault injection is coming soon. In the meantime, you can follow the steps above on your own machine.
🔍 Verify it really works
While a fault is running, watch the agent's logs in real time:
kubectl -n cloudsentinel logs -l app.kubernetes.io/name=cloudsentinel-agent -f
You should see structured JSON logs like:
{"time":"2026-04-27T21:11:40Z","level":"INFO","msg":"fault registered","fault_id":"daef1638-9e31-44fc-ab3a-c1e8c6b232d3","type":"cpu_stress","duration":60000000000} {"time":"2026-04-27T21:11:40Z","level":"INFO","msg":"fault injected","fault_id":"daef1638-9e31-44fc-ab3a-c1e8c6b232d3","type":"cpu_stress","duration_seconds":60}
The auto-cleanup happens at exactly the configured duration, regardless of whether the gRPC client is still connected.
📊 What's verified
Every fault implementation is tested with measurements that prove the behavior, not just check return codes:
| Fault | Verified by | Result |
|---|---|---|
| CPU Stress | syscall.Getrusage measures consumed CPU-seconds |
1.5 CPU-seconds per 1 wall-second with 2 workers at 80% |
| Memory Pressure | runtime.MemStats measures heap allocation |
100 MB requested → 100 MB allocated → 0 MB after Stop+GC |
| Disk Fill | os.Stat measures file size on disk |
10 MiB requested → 10485760 bytes exact, fsync verified |
| Network Latency | Mock TCExecutor verifies tc invocation | Idempotent Stop, no leaked qdiscs on Start failure |
35 unit tests pass in approximately 2 seconds. End-to-end validated on Kind v0.23.0 with WSL2 Ubuntu 22.04.
🏗️ What's happening under the hood
When you call InjectFault on the gRPC API, here is what happens inside the agent:
- The gRPC server validates the request (type, duration, parameters)
- A
Faultobject is constructed by the right factory (NewCPUStress,NewMemoryPressure, etc.) - The Registry assigns a UUID, stores the fault, and starts a
time.AfterFunctimer Fault.Start(ctx)is called: worker goroutines spawn, memory is allocated, file is created, ortcrules are added- The gRPC response returns immediately with the assigned
fault_id - After the configured duration, the timer fires and
Fault.Stop(ctx)runs the cleanup - Optionally, the client can call
Rollback(fault_id)early to stop the fault before its timer fires
Each Fault implementation is a separate Go file in internal/faults/. They share the same interface (ID, Type, Start, Stop, StartedAt, Duration) so adding a new fault type only requires writing the new file plus one line in the buildFault dispatcher.
🛠️ Clean up
When you are done, tear down the demo cluster:
make demo-clean
This deletes the Kind cluster and frees up the local Docker resources. Your machine is back to its previous state.
🚀 Next steps
- Read the Architecture page for the full design overview
- Browse the source code on GitHub
- Check the Roadmap for what is coming in Niveau 2