“Autonomous cluster management” — the promise that an AI can monitor your Kubernetes cluster, diagnose problems, and perhaps even fix them without human intervention. It sounds like the holy grail for platform engineers.

The reality is more nuanced.

In this post I test K8sGPT with a locally running Llama 3.3 70B model on Apple Silicon. No cloud APIs, no data leaving your network, fully sovereign. Is this usable for real cluster diagnosis? Let’s find out.

Disclaimer: This is a homelab experiment. I’m describing what I tested and what I found. This is not a recommendation to run this in production — quite the opposite, as the security analysis will show.

Hardware and Software Stack

The Hardware

  • Mac Studio M3 Ultra with 512GB unified memory
  • The M3 Ultra has 80 GPU cores you can use for inference
  • Unified memory means no copying between CPU and GPU RAM

This isn’t a cheap setup (~€10,000), but it’s the only consumer hardware that can run a 70B model in Q8 quantization at acceptable speeds.

The Software Stack

ComponentVersionRole
vLLM0.6.xInference server with Metal backend
Llama 3.3 70BQ8_0The language model (~75GB)
K8sGPT Operator0.1.xKubernetes operator for diagnosis
k3s1.29.xLocal Kubernetes cluster

Installation: vLLM with Metal Backend

vLLM has experimental Metal support for Apple Silicon. Installation:

# Create a dedicated conda environment
conda create -n vllm python=3.11
conda activate vllm

# Install vLLM with Metal support
pip install vllm

# Verify Metal backend
python -c "import vllm; print(vllm.__version__)"

Note: At time of writing, Metal support in vLLM is still experimental. For production-like workloads, llama.cpp with the Metal backend is more stable, but K8sGPT expects an OpenAI-compatible API.

Model Download

# Download the model (Q8 quantization, ~75GB)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct \
  --local-dir ./models/llama-3.3-70b-instruct

You need a Hugging Face account and must accept the Llama license.

Starting the vLLM Server

# Start the inference server
vllm serve ./models/llama-3.3-70b-instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  --max-model-len 8192 \
  --device mps

The --device mps flag forces Metal Performance Shaders. Without this flag, vLLM falls back to CPU.

Verify the server is running:

curl http://localhost:8000/v1/models

K8sGPT Operator Deployment

Install the K8sGPT operator in your cluster:

# Add the Helm repo
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update

# Install the operator
helm install k8sgpt-operator k8sgpt/k8sgpt-operator \
  -n k8sgpt-operator-system \
  --create-namespace

Configure a custom backend pointing to your local vLLM server:

# k8sgpt-config.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-local
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: llama-3.3-70b-instruct
    backend: localai
    baseUrl: http://192.168.1.100:8000/v1  # IP of your Mac Studio
  noCache: false
  version: v0.3.40
  analyzers:
    - Pod
    - Deployment
    - Service
    - ReplicaSet
    - PersistentVolumeClaim
    - Ingress
    - StatefulSet
    - CronJob
kubectl apply -f k8sgpt-config.yaml

Test Scenarios

Now that the setup is running, time to test whether it’s actually useful.

Scenario A: CrashLoopBackOff Diagnosis

I introduce a deployment with a missing ConfigMap:

# broken-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: broken-app
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: broken-app
  template:
    metadata:
      labels:
        app: broken-app
    spec:
      containers:
      - name: app
        image: nginx:1.25
        envFrom:
        - configMapRef:
            name: app-config  # This ConfigMap doesn't exist
kubectl apply -f broken-deployment.yaml

After a minute, the pod is in CrashLoopBackOff. K8sGPT analysis:

kubectl get results -n k8sgpt-operator-system -o yaml

Output (paraphrased):

Analysis: Pod broken-app-xxxx is in CreateContainerConfigError state.
The pod is referencing a ConfigMap 'app-config' that does not exist
in the namespace.

Suggested remediation:
1. Create the missing ConfigMap:
   kubectl create configmap app-config --from-literal=KEY=value
2. Or remove the configMapRef from the deployment spec
3. Verify the ConfigMap name spelling matches exactly

Score: Usable

The diagnosis is correct and the suggestions are practical. A junior engineer could work with this. What’s missing: it doesn’t suggest first checking if the ConfigMap might be in a different namespace, which is a common mistake.

Scenario B: Resource Recommendations

A deployment without resource limits:

# no-limits.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: no-limits-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: no-limits
  template:
    metadata:
      labels:
        app: no-limits
    spec:
      containers:
      - name: app
        image: nginx:1.25
        # No resources defined

K8sGPT analysis:

Analysis: Deployment no-limits-app has containers without resource
requests or limits defined.

Issues identified:
- Container 'app' has no CPU requests/limits
- Container 'app' has no memory requests/limits

Suggested remediation:
Add resource specifications to ensure predictable scheduling and
prevent resource starvation:

resources:
  requests:
    memory: "64Mi"
    cpu: "100m"
  limits:
    memory: "128Mi"
    cpu: "200m"

Score: Partially usable

It correctly identifies the problem, but the suggested values are generic and not based on actual consumption. An experienced engineer would first collect metrics with kubectl top pods or Prometheus data, then rightsize.

What’s missing:

  • No suggestion to use VPA (Vertical Pod Autoscaler)
  • No warning that limits without requests can be problematic
  • The requests:limits ratio (1:2) is arbitrary

Scenario C: Security Audit

A pod with poor security practices:

# insecure-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: insecure-pod
  namespace: default
spec:
  hostNetwork: true
  containers:
  - name: app
    image: nginx:1.25
    securityContext:
      privileged: true
      runAsRoot: true
    volumeMounts:
    - name: host-root
      mountPath: /host
  volumes:
  - name: host-root
    hostPath:
      path: /

K8sGPT analysis with Trivy integration:

Analysis: Pod insecure-pod has multiple security issues:

Critical:
- Container running in privileged mode (privileged: true)
- Container has access to host network (hostNetwork: true)
- Container mounts host root filesystem

High:
- No securityContext.readOnlyRootFilesystem
- No securityContext.allowPrivilegeEscalation: false

Suggested remediation:
1. Remove privileged: true unless absolutely necessary
2. Use a dedicated network namespace (remove hostNetwork: true)
3. Avoid mounting host paths, especially root filesystem
4. Add security hardening:

   securityContext:
     privileged: false
     readOnlyRootFilesystem: true
     allowPrivilegeEscalation: false
     runAsNonRoot: true
     capabilities:
       drop:
         - ALL

Score: Usable

This is surprisingly good. The issues are correctly identified and the remediation is what a security engineer would advise. The Trivy integration adds value by also checking image vulnerabilities.

What’s missing:

  • No suggestion for Pod Security Standards (restricted profile)
  • No warning about Kyverno/OPA policies that should block this

Performance Metrics

Inference Speed

MetricValue
Tokens/second (prompt)~180 t/s
Tokens/second (generation)~25 t/s
First token latency~2.5s
Typical analysis (500 tokens out)~22s

End-to-End Latency

From issue detection to report in K8sGPT:

PhaseTime
Issue detection (polling)30s (configurable)
Context gathering~2s
LLM inference~20-30s
Result storage<1s
Total~55s

Resource Usage

During inference:

ResourceUsage
GPU Memory (Metal)~78GB
CPU~15% (data preprocessing)
System Memory~12GB (besides model)
Power draw~180W

Comparison with OpenAI API

MetricLocal (70B)OpenAI GPT-4
Latency~25s~5s
QualityGoodVery good
Cost€0 (after hardware)~€0.03/query
PrivacyFully localData to OpenAI

The OpenAI API is faster and the output is marginally better, but your data leaves your network.

Air-Gapped Deployment

Can this setup work without internet connection? Yes, with preparation.

What You Need to Download Beforehand

# 1. Model weights (~75GB)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct \
  --local-dir ./airgap-bundle/models/

# 2. vLLM Python packages
pip download vllm -d ./airgap-bundle/packages/

# 3. K8sGPT container images
docker pull ghcr.io/k8sgpt-ai/k8sgpt-operator:latest
docker save ghcr.io/k8sgpt-ai/k8sgpt-operator:latest > ./airgap-bundle/images/k8sgpt-operator.tar

# 4. Helm charts
helm pull k8sgpt/k8sgpt-operator --destination ./airgap-bundle/charts/

Transport and Installation

# On the air-gapped machine:

# 1. Install Python packages offline
pip install --no-index --find-links=./airgap-bundle/packages/ vllm

# 2. Load container images
docker load < ./airgap-bundle/images/k8sgpt-operator.tar
# Or push to your local registry

# 3. Install Helm chart
helm install k8sgpt-operator ./airgap-bundle/charts/k8sgpt-operator-*.tgz \
  --set image.repository=your-local-registry/k8sgpt-operator

Air-Gap Friendly Components

ComponentAir-Gap ReadyNotes
vLLMYesNo phone-home
Llama modelYesOne-time download
K8sGPT OperatorYesNo telemetry
Trivy DBNoRequires periodic updates

Note: The Trivy vulnerability database needs to be updated and transported separately. Without a recent DB, K8sGPT will miss new CVEs.

Security Analysis and Threat Model

This is where it gets interesting. Let’s be honest about the risks.

Platform Security Issues

A Mac Studio as inference server has fundamental limitations:

IssueImpact
No TPMNo hardware attestation, no measured boot
macOS is general-purposeNot hardened like RHEL/Ubuntu with CIS benchmarks
No Secure Boot chainBoot process is not cryptographically verified
Updates require internetOr manual intervention in air-gapped scenario
Single-user focusmacOS is not designed for multi-tenant security

Conclusion: A Mac Studio is unsuitable for environments with strict compliance requirements (ISO27001 Annex A, NIS2, SOC2). For homelab and development it’s acceptable.

LLM-Specific Risks

RiskDescription
Non-determinismSame input can produce different outputs
Prompt injectionMalicious pod names/labels can manipulate the LLM
HallucinationsModel can suggest harmful remediation
Context leakageInfo from earlier queries can appear in responses
Supply chainModel weights could be backdoored

Threat Model

ThreatLikelihoodImpactMitigation
Prompt injection via pod metadataMediumHighInput sanitization, output validation
Hallucinated destructive commandsMediumCriticalHuman-in-the-loop, no auto-remediation
Model weights tamperingLowCriticalChecksum verification, trusted source
Context window data leakageMediumMediumShort context, no persistent memory
Unauthorized access to inference APIMediumHighNetwork segmentation, auth
Resource exhaustion (DoS)LowMediumRate limiting, resource quotas

Conclusions and Recommendations

Is a Local LLM Usable for Kubernetes Diagnosis?

Yes, under conditions.

It can:

  • Correctly identify standard issues
  • Provide usable remediation suggestions
  • Detect security problems
  • Do all this without data leaving your network

It cannot:

  • Debug complex, multi-component issues
  • Reliably do auto-remediation
  • Understand the context of your specific setup
  • Guarantee correctness

Recommendations per Use Case

Homelab / Learning

Recommendation: Go for it.

This is an excellent way to learn about:

  • LLM inference infrastructure
  • Kubernetes troubleshooting patterns
  • The limits of AI-assisted operations

Risks are acceptable because the impact is limited.

Development / Staging

Recommendation: Usable with guardrails.

Implement:

  • Output review before applying suggestions
  • Logging of all LLM interactions
  • No auto-remediation, diagnosis only

Production (not air-gapped)

Recommendation: Use cloud APIs.

Why:

  • Better models (GPT-4, Claude)
  • Lower latency
  • No hardware investment
  • Professional SLAs

The privacy trade-off is acceptable for most organizations if you don’t have PII in cluster metadata.

Production (air-gapped / sovereign)

Recommendation: Only as last resort.

If you truly cannot send data outside:

  • Consider smaller, dedicated models
  • Implement defense-in-depth for the inference server
  • Treat all LLM output as untrusted
  • Ensure extensive logging and audit trails
  • Use this as assistance, never as authority

The State of Autonomous Cluster Management

Let me be direct: “autonomous cluster management” with LLMs is currently marketing, not reality.

What we have is “assisted cluster management” — an AI that can help with diagnosis and make suggestions. But the human-in-the-loop is not optional. It’s a requirement.

The technology is impressive. A 70B model can produce surprisingly good analyses. But surprisingly good is not good enough for autonomous action on production infrastructure.

My advice: use these tools as a smart colleague you can consult. Not as a replacement for your own judgment.


Related posts: