Building infrastructure you own, understand, and can’t break.

Cilium Deep Dive: eBPF Networking for Kubernetes
Kubernetes networking is notoriously complex. CNI plugins, kube-proxy, iptables chains, service meshes — layers upon layers of abstraction that eventually break in ways nobody understands. Cilium changes this. It uses eBPF to move networking logic into the Linux kernel, bypassing iptables entirely. The result: better performance, more visibility, and network policies that actually make sense. This is what I run in my clusters. Let me show you why. What is eBPF? eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules. ...

Distributed Tracing with Tempo and OpenTelemetry
You have metrics telling you something is slow. You have logs telling you errors happened. But which request failed? Where did the latency come from? Which service in the chain caused the timeout? This is where distributed tracing comes in. It follows individual requests as they flow through your microservices, showing you exactly what happened and where. The Observability Triangle flowchart TD subgraph observability["Complete Observability"] M["Metrics<br/>(Prometheus/Thanos)<br/>WHAT is happening"] L["Logs<br/>(Loki)<br/>WHY it happened"] T["Traces<br/>(Tempo)<br/>WHERE it happened"] end M <--> L L <--> T T <--> M G["Grafana"] --> M G --> L G --> T Metrics answer: “What is the error rate? What is the latency?” Logs answer: “What error message? What was the context?” Traces answer: “Which service? Which call? What was the path?” Together, they give you complete understanding. ...

Loki for Kubernetes Logging: The Prometheus-Like Approach
You’ve got Prometheus for metrics. You can see what’s happening across your clusters. But when something breaks, metrics tell you that something is wrong — logs tell you why. The traditional answer is Elasticsearch. It’s powerful, flexible, and… expensive. It indexes everything, which means you pay for every byte of log data in CPU, memory, and storage. Loki takes a different approach: index labels, not content. It’s the same philosophy that makes Prometheus efficient for metrics, applied to logs. ...

Thanos Remote Write: Push-Based Metrics for Edge and Multi-Cluster
In my previous post on Prometheus and Thanos, I covered the sidecar architecture — Thanos Sidecar runs alongside Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier. It works beautifully for clusters with stable connectivity to your central infrastructure. But what happens when your clusters are at the edge? When they might lose connectivity for hours or days? When you’re running dozens or hundreds of small clusters and don’t want sidecar complexity on each one? ...

Declarative Infrastructure as Compliance Documentation: Talos, NixOS, and Audit-Ready Systems
Compliance audits are painful. Anyone who’s been through ISO 27001 certification knows the drill: weeks of documentation gathering, screenshots of configurations, evidence of change management processes, proof that what you say you do is what you actually do. But here’s something I’ve realized after running declarative infrastructure for years: systems like Talos and NixOS don’t just make infrastructure better — they make compliance dramatically easier. The same properties that make these systems reliable (immutability, reproducibility, auditability) are exactly what auditors want to see. ...