“Autonomous cluster management” — the promise that an AI can monitor your Kubernetes cluster, diagnose problems, and perhaps even fix them without human intervention. It sounds like the holy grail for platform engineers.
The reality is more nuanced.
In this post I test K8sGPT with a locally running Llama 3.3 70B model on Apple Silicon. No cloud APIs, no data leaving your network, fully sovereign. Is this usable for real cluster diagnosis? Let’s find out.
Disclaimer: This is a homelab experiment. I’m describing what I tested and what I found. This is not a recommendation to run this in production — quite the opposite, as the security analysis will show.
Hardware and Software Stack
The Hardware
- Mac Studio M3 Ultra with 512GB unified memory
- The M3 Ultra has 80 GPU cores you can use for inference
- Unified memory means no copying between CPU and GPU RAM
This isn’t a cheap setup (~€10,000), but it’s the only consumer hardware that can run a 70B model in Q8 quantization at acceptable speeds.
The Software Stack
| Component | Version | Role |
|---|---|---|
| vLLM | 0.6.x | Inference server with Metal backend |
| Llama 3.3 70B | Q8_0 | The language model (~75GB) |
| K8sGPT Operator | 0.1.x | Kubernetes operator for diagnosis |
| k3s | 1.29.x | Local Kubernetes cluster |
Installation: vLLM with Metal Backend
vLLM has experimental Metal support for Apple Silicon. Installation:
# Create a dedicated conda environment
conda create -n vllm python=3.11
conda activate vllm
# Install vLLM with Metal support
pip install vllm
# Verify Metal backend
python -c "import vllm; print(vllm.__version__)"
Note: At time of writing, Metal support in vLLM is still experimental. For production-like workloads, llama.cpp with the Metal backend is more stable, but K8sGPT expects an OpenAI-compatible API.
Model Download
# Download the model (Q8 quantization, ~75GB)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct \
--local-dir ./models/llama-3.3-70b-instruct
You need a Hugging Face account and must accept the Llama license.
Starting the vLLM Server
# Start the inference server
vllm serve ./models/llama-3.3-70b-instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
--max-model-len 8192 \
--device mps
The --device mps flag forces Metal Performance Shaders. Without this flag, vLLM falls back to CPU.
Verify the server is running:
curl http://localhost:8000/v1/models
K8sGPT Operator Deployment
Install the K8sGPT operator in your cluster:
# Add the Helm repo
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
# Install the operator
helm install k8sgpt-operator k8sgpt/k8sgpt-operator \
-n k8sgpt-operator-system \
--create-namespace
Configure a custom backend pointing to your local vLLM server:
# k8sgpt-config.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
name: k8sgpt-local
namespace: k8sgpt-operator-system
spec:
ai:
enabled: true
model: llama-3.3-70b-instruct
backend: localai
baseUrl: http://192.168.1.100:8000/v1 # IP of your Mac Studio
noCache: false
version: v0.3.40
analyzers:
- Pod
- Deployment
- Service
- ReplicaSet
- PersistentVolumeClaim
- Ingress
- StatefulSet
- CronJob
kubectl apply -f k8sgpt-config.yaml
Test Scenarios
Now that the setup is running, time to test whether it’s actually useful.
Scenario A: CrashLoopBackOff Diagnosis
I introduce a deployment with a missing ConfigMap:
# broken-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: broken-app
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: broken-app
template:
metadata:
labels:
app: broken-app
spec:
containers:
- name: app
image: nginx:1.25
envFrom:
- configMapRef:
name: app-config # This ConfigMap doesn't exist
kubectl apply -f broken-deployment.yaml
After a minute, the pod is in CrashLoopBackOff. K8sGPT analysis:
kubectl get results -n k8sgpt-operator-system -o yaml
Output (paraphrased):
Analysis: Pod broken-app-xxxx is in CreateContainerConfigError state.
The pod is referencing a ConfigMap 'app-config' that does not exist
in the namespace.
Suggested remediation:
1. Create the missing ConfigMap:
kubectl create configmap app-config --from-literal=KEY=value
2. Or remove the configMapRef from the deployment spec
3. Verify the ConfigMap name spelling matches exactly
Score: Usable
The diagnosis is correct and the suggestions are practical. A junior engineer could work with this. What’s missing: it doesn’t suggest first checking if the ConfigMap might be in a different namespace, which is a common mistake.
Scenario B: Resource Recommendations
A deployment without resource limits:
# no-limits.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: no-limits-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: no-limits
template:
metadata:
labels:
app: no-limits
spec:
containers:
- name: app
image: nginx:1.25
# No resources defined
K8sGPT analysis:
Analysis: Deployment no-limits-app has containers without resource
requests or limits defined.
Issues identified:
- Container 'app' has no CPU requests/limits
- Container 'app' has no memory requests/limits
Suggested remediation:
Add resource specifications to ensure predictable scheduling and
prevent resource starvation:
resources:
requests:
memory: "64Mi"
cpu: "100m"
limits:
memory: "128Mi"
cpu: "200m"
Score: Partially usable
It correctly identifies the problem, but the suggested values are generic and not based on actual consumption. An experienced engineer would first collect metrics with kubectl top pods or Prometheus data, then rightsize.
What’s missing:
- No suggestion to use VPA (Vertical Pod Autoscaler)
- No warning that limits without requests can be problematic
- The requests:limits ratio (1:2) is arbitrary
Scenario C: Security Audit
A pod with poor security practices:
# insecure-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: insecure-pod
namespace: default
spec:
hostNetwork: true
containers:
- name: app
image: nginx:1.25
securityContext:
privileged: true
runAsRoot: true
volumeMounts:
- name: host-root
mountPath: /host
volumes:
- name: host-root
hostPath:
path: /
K8sGPT analysis with Trivy integration:
Analysis: Pod insecure-pod has multiple security issues:
Critical:
- Container running in privileged mode (privileged: true)
- Container has access to host network (hostNetwork: true)
- Container mounts host root filesystem
High:
- No securityContext.readOnlyRootFilesystem
- No securityContext.allowPrivilegeEscalation: false
Suggested remediation:
1. Remove privileged: true unless absolutely necessary
2. Use a dedicated network namespace (remove hostNetwork: true)
3. Avoid mounting host paths, especially root filesystem
4. Add security hardening:
securityContext:
privileged: false
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
runAsNonRoot: true
capabilities:
drop:
- ALL
Score: Usable
This is surprisingly good. The issues are correctly identified and the remediation is what a security engineer would advise. The Trivy integration adds value by also checking image vulnerabilities.
What’s missing:
- No suggestion for Pod Security Standards (restricted profile)
- No warning about Kyverno/OPA policies that should block this
Performance Metrics
Inference Speed
| Metric | Value |
|---|---|
| Tokens/second (prompt) | ~180 t/s |
| Tokens/second (generation) | ~25 t/s |
| First token latency | ~2.5s |
| Typical analysis (500 tokens out) | ~22s |
End-to-End Latency
From issue detection to report in K8sGPT:
| Phase | Time |
|---|---|
| Issue detection (polling) | 30s (configurable) |
| Context gathering | ~2s |
| LLM inference | ~20-30s |
| Result storage | <1s |
| Total | ~55s |
Resource Usage
During inference:
| Resource | Usage |
|---|---|
| GPU Memory (Metal) | ~78GB |
| CPU | ~15% (data preprocessing) |
| System Memory | ~12GB (besides model) |
| Power draw | ~180W |
Comparison with OpenAI API
| Metric | Local (70B) | OpenAI GPT-4 |
|---|---|---|
| Latency | ~25s | ~5s |
| Quality | Good | Very good |
| Cost | €0 (after hardware) | ~€0.03/query |
| Privacy | Fully local | Data to OpenAI |
The OpenAI API is faster and the output is marginally better, but your data leaves your network.
Air-Gapped Deployment
Can this setup work without internet connection? Yes, with preparation.
What You Need to Download Beforehand
# 1. Model weights (~75GB)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct \
--local-dir ./airgap-bundle/models/
# 2. vLLM Python packages
pip download vllm -d ./airgap-bundle/packages/
# 3. K8sGPT container images
docker pull ghcr.io/k8sgpt-ai/k8sgpt-operator:latest
docker save ghcr.io/k8sgpt-ai/k8sgpt-operator:latest > ./airgap-bundle/images/k8sgpt-operator.tar
# 4. Helm charts
helm pull k8sgpt/k8sgpt-operator --destination ./airgap-bundle/charts/
Transport and Installation
# On the air-gapped machine:
# 1. Install Python packages offline
pip install --no-index --find-links=./airgap-bundle/packages/ vllm
# 2. Load container images
docker load < ./airgap-bundle/images/k8sgpt-operator.tar
# Or push to your local registry
# 3. Install Helm chart
helm install k8sgpt-operator ./airgap-bundle/charts/k8sgpt-operator-*.tgz \
--set image.repository=your-local-registry/k8sgpt-operator
Air-Gap Friendly Components
| Component | Air-Gap Ready | Notes |
|---|---|---|
| vLLM | Yes | No phone-home |
| Llama model | Yes | One-time download |
| K8sGPT Operator | Yes | No telemetry |
| Trivy DB | No | Requires periodic updates |
Note: The Trivy vulnerability database needs to be updated and transported separately. Without a recent DB, K8sGPT will miss new CVEs.
Security Analysis and Threat Model
This is where it gets interesting. Let’s be honest about the risks.
Platform Security Issues
A Mac Studio as inference server has fundamental limitations:
| Issue | Impact |
|---|---|
| No TPM | No hardware attestation, no measured boot |
| macOS is general-purpose | Not hardened like RHEL/Ubuntu with CIS benchmarks |
| No Secure Boot chain | Boot process is not cryptographically verified |
| Updates require internet | Or manual intervention in air-gapped scenario |
| Single-user focus | macOS is not designed for multi-tenant security |
Conclusion: A Mac Studio is unsuitable for environments with strict compliance requirements (ISO27001 Annex A, NIS2, SOC2). For homelab and development it’s acceptable.
LLM-Specific Risks
| Risk | Description |
|---|---|
| Non-determinism | Same input can produce different outputs |
| Prompt injection | Malicious pod names/labels can manipulate the LLM |
| Hallucinations | Model can suggest harmful remediation |
| Context leakage | Info from earlier queries can appear in responses |
| Supply chain | Model weights could be backdoored |
Threat Model
| Threat | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Prompt injection via pod metadata | Medium | High | Input sanitization, output validation |
| Hallucinated destructive commands | Medium | Critical | Human-in-the-loop, no auto-remediation |
| Model weights tampering | Low | Critical | Checksum verification, trusted source |
| Context window data leakage | Medium | Medium | Short context, no persistent memory |
| Unauthorized access to inference API | Medium | High | Network segmentation, auth |
| Resource exhaustion (DoS) | Low | Medium | Rate limiting, resource quotas |
Conclusions and Recommendations
Is a Local LLM Usable for Kubernetes Diagnosis?
Yes, under conditions.
It can:
- Correctly identify standard issues
- Provide usable remediation suggestions
- Detect security problems
- Do all this without data leaving your network
It cannot:
- Debug complex, multi-component issues
- Reliably do auto-remediation
- Understand the context of your specific setup
- Guarantee correctness
Recommendations per Use Case
Homelab / Learning
Recommendation: Go for it.
This is an excellent way to learn about:
- LLM inference infrastructure
- Kubernetes troubleshooting patterns
- The limits of AI-assisted operations
Risks are acceptable because the impact is limited.
Development / Staging
Recommendation: Usable with guardrails.
Implement:
- Output review before applying suggestions
- Logging of all LLM interactions
- No auto-remediation, diagnosis only
Production (not air-gapped)
Recommendation: Use cloud APIs.
Why:
- Better models (GPT-4, Claude)
- Lower latency
- No hardware investment
- Professional SLAs
The privacy trade-off is acceptable for most organizations if you don’t have PII in cluster metadata.
Production (air-gapped / sovereign)
Recommendation: Only as last resort.
If you truly cannot send data outside:
- Consider smaller, dedicated models
- Implement defense-in-depth for the inference server
- Treat all LLM output as untrusted
- Ensure extensive logging and audit trails
- Use this as assistance, never as authority
The State of Autonomous Cluster Management
Let me be direct: “autonomous cluster management” with LLMs is currently marketing, not reality.
What we have is “assisted cluster management” — an AI that can help with diagnosis and make suggestions. But the human-in-the-loop is not optional. It’s a requirement.
The technology is impressive. A 70B model can produce surprisingly good analyses. But surprisingly good is not good enough for autonomous action on production infrastructure.
My advice: use these tools as a smart colleague you can consult. Not as a replacement for your own judgment.
Related posts:
- Sovereign Infrastructure — Why I self-host everything
- Why Privacy Matters — The context for local LLMs
