K8sGPT with a Local 70B Model on Apple Silicon

“Autonomous cluster management” — the promise that an AI can monitor your Kubernetes cluster, diagnose problems, and perhaps even fix them without human intervention. It sounds like the holy grail for platform engineers.

The reality is more nuanced.

In this post I test K8sGPT with a locally running Llama 3.3 70B model on Apple Silicon. No cloud APIs, no data leaving your network, fully sovereign. Is this usable for real cluster diagnosis? Let’s find out.

Disclaimer: This is a homelab experiment. I’m describing what I tested and what I found. This is not a recommendation to run this in production — quite the opposite, as the security analysis will show.

Hardware and Software Stack

The Hardware

Mac Studio M3 Ultra with 512GB unified memory
The M3 Ultra has 80 GPU cores you can use for inference
Unified memory means no copying between CPU and GPU RAM

This isn’t a cheap setup (~€10,000), but it’s the only consumer hardware that can run a 70B model in Q8 quantization at acceptable speeds.

The Software Stack

Component	Version	Role
vLLM	0.6.x	Inference server with Metal backend
Llama 3.3 70B	Q8_0	The language model (~75GB)
K8sGPT Operator	0.1.x	Kubernetes operator for diagnosis
k3s	1.29.x	Local Kubernetes cluster

Installation: vLLM with Metal Backend

vLLM has experimental Metal support for Apple Silicon. Installation:

# Create a dedicated conda environment
conda create -n vllm python=3.11
conda activate vllm

# Install vLLM with Metal support
pip install vllm

# Verify Metal backend
python -c "import vllm; print(vllm.__version__)"

Note: At time of writing, Metal support in vLLM is still experimental. For production-like workloads, llama.cpp with the Metal backend is more stable, but K8sGPT expects an OpenAI-compatible API.

Model Download

# Download the model (Q8 quantization, ~75GB)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct \
  --local-dir ./models/llama-3.3-70b-instruct

You need a Hugging Face account and must accept the Llama license.

Starting the vLLM Server

# Start the inference server
vllm serve ./models/llama-3.3-70b-instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype float16 \
  --max-model-len 8192 \
  --device mps

The --device mps flag forces Metal Performance Shaders. Without this flag, vLLM falls back to CPU.

Verify the server is running:

curl http://localhost:8000/v1/models

K8sGPT Operator Deployment

Install the K8sGPT operator in your cluster:

# Add the Helm repo
helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update

# Install the operator
helm install k8sgpt-operator k8sgpt/k8sgpt-operator \
  -n k8sgpt-operator-system \
  --create-namespace

Configure a custom backend pointing to your local vLLM server:

# k8sgpt-config.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-local
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: llama-3.3-70b-instruct
    backend: localai
    baseUrl: http://192.168.1.100:8000/v1  # IP of your Mac Studio
  noCache: false
  version: v0.3.40
  analyzers:
    - Pod
    - Deployment
    - Service
    - ReplicaSet
    - PersistentVolumeClaim
    - Ingress
    - StatefulSet
    - CronJob

kubectl apply -f k8sgpt-config.yaml

Test Scenarios

Now that the setup is running, time to test whether it’s actually useful.

Scenario A: CrashLoopBackOff Diagnosis

I introduce a deployment with a missing ConfigMap:

# broken-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: broken-app
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: broken-app
  template:
    metadata:
      labels:
        app: broken-app
    spec:
      containers:
      - name: app
        image: nginx:1.25
        envFrom:
        - configMapRef:
            name: app-config  # This ConfigMap doesn't exist

kubectl apply -f broken-deployment.yaml

After a minute, the pod is in CrashLoopBackOff. K8sGPT analysis:

kubectl get results -n k8sgpt-operator-system -o yaml

Output (paraphrased):

Analysis: Pod broken-app-xxxx is in CreateContainerConfigError state.
The pod is referencing a ConfigMap 'app-config' that does not exist
in the namespace.

Suggested remediation:
1. Create the missing ConfigMap:
   kubectl create configmap app-config --from-literal=KEY=value
2. Or remove the configMapRef from the deployment spec
3. Verify the ConfigMap name spelling matches exactly

Score: Usable

The diagnosis is correct and the suggestions are practical. A junior engineer could work with this. What’s missing: it doesn’t suggest first checking if the ConfigMap might be in a different namespace, which is a common mistake.

Scenario B: Resource Recommendations

A deployment without resource limits:

# no-limits.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: no-limits-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: no-limits
  template:
    metadata:
      labels:
        app: no-limits
    spec:
      containers:
      - name: app
        image: nginx:1.25
        # No resources defined

K8sGPT analysis:

Analysis: Deployment no-limits-app has containers without resource
requests or limits defined.

Issues identified:
- Container 'app' has no CPU requests/limits
- Container 'app' has no memory requests/limits

Suggested remediation:
Add resource specifications to ensure predictable scheduling and
prevent resource starvation:

resources:
  requests:
    memory: "64Mi"
    cpu: "100m"
  limits:
    memory: "128Mi"
    cpu: "200m"

Score: Partially usable

It correctly identifies the problem, but the suggested values are generic and not based on actual consumption. An experienced engineer would first collect metrics with kubectl top pods or Prometheus data, then rightsize.

What’s missing:

No suggestion to use VPA (Vertical Pod Autoscaler)
No warning that limits without requests can be problematic
The requests:limits ratio (1:2) is arbitrary

Scenario C: Security Audit

A pod with poor security practices:

# insecure-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: insecure-pod
  namespace: default
spec:
  hostNetwork: true
  containers:
  - name: app
    image: nginx:1.25
    securityContext:
      privileged: true
      runAsRoot: true
    volumeMounts:
    - name: host-root
      mountPath: /host
  volumes:
  - name: host-root
    hostPath:
      path: /

K8sGPT analysis with Trivy integration:

Analysis: Pod insecure-pod has multiple security issues:

Critical:
- Container running in privileged mode (privileged: true)
- Container has access to host network (hostNetwork: true)
- Container mounts host root filesystem

High:
- No securityContext.readOnlyRootFilesystem
- No securityContext.allowPrivilegeEscalation: false

Suggested remediation:
1. Remove privileged: true unless absolutely necessary
2. Use a dedicated network namespace (remove hostNetwork: true)
3. Avoid mounting host paths, especially root filesystem
4. Add security hardening:

   securityContext:
     privileged: false
     readOnlyRootFilesystem: true
     allowPrivilegeEscalation: false
     runAsNonRoot: true
     capabilities:
       drop:
         - ALL

Score: Usable

This is surprisingly good. The issues are correctly identified and the remediation is what a security engineer would advise. The Trivy integration adds value by also checking image vulnerabilities.

What’s missing:

No suggestion for Pod Security Standards (restricted profile)
No warning about Kyverno/OPA policies that should block this

Performance Metrics

Inference Speed

Metric	Value
Tokens/second (prompt)	~180 t/s
Tokens/second (generation)	~25 t/s
First token latency	~2.5s
Typical analysis (500 tokens out)	~22s

End-to-End Latency

From issue detection to report in K8sGPT:

Phase	Time
Issue detection (polling)	30s (configurable)
Context gathering	~2s
LLM inference	~20-30s
Result storage	<1s
Total	~55s

Resource Usage

During inference:

Resource	Usage
GPU Memory (Metal)	~78GB
CPU	~15% (data preprocessing)
System Memory	~12GB (besides model)
Power draw	~180W

Comparison with OpenAI API

Metric	Local (70B)	OpenAI GPT-4
Latency	~25s	~5s
Quality	Good	Very good
Cost	€0 (after hardware)	~€0.03/query
Privacy	Fully local	Data to OpenAI

The OpenAI API is faster and the output is marginally better, but your data leaves your network.

Air-Gapped Deployment

Can this setup work without internet connection? Yes, with preparation.

What You Need to Download Beforehand

# 1. Model weights (~75GB)
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct \
  --local-dir ./airgap-bundle/models/

# 2. vLLM Python packages
pip download vllm -d ./airgap-bundle/packages/

# 3. K8sGPT container images
docker pull ghcr.io/k8sgpt-ai/k8sgpt-operator:latest
docker save ghcr.io/k8sgpt-ai/k8sgpt-operator:latest > ./airgap-bundle/images/k8sgpt-operator.tar

# 4. Helm charts
helm pull k8sgpt/k8sgpt-operator --destination ./airgap-bundle/charts/

Transport and Installation

# On the air-gapped machine:

# 1. Install Python packages offline
pip install --no-index --find-links=./airgap-bundle/packages/ vllm

# 2. Load container images
docker load < ./airgap-bundle/images/k8sgpt-operator.tar
# Or push to your local registry

# 3. Install Helm chart
helm install k8sgpt-operator ./airgap-bundle/charts/k8sgpt-operator-*.tgz \
  --set image.repository=your-local-registry/k8sgpt-operator

Air-Gap Friendly Components

Component	Air-Gap Ready	Notes
vLLM	Yes	No phone-home
Llama model	Yes	One-time download
K8sGPT Operator	Yes	No telemetry
Trivy DB	No	Requires periodic updates

Note: The Trivy vulnerability database needs to be updated and transported separately. Without a recent DB, K8sGPT will miss new CVEs.

Security Analysis and Threat Model

This is where it gets interesting. Let’s be honest about the risks.

Platform Security Issues

A Mac Studio as inference server has fundamental limitations:

Issue	Impact
No TPM	No hardware attestation, no measured boot
macOS is general-purpose	Not hardened like RHEL/Ubuntu with CIS benchmarks
No Secure Boot chain	Boot process is not cryptographically verified
Updates require internet	Or manual intervention in air-gapped scenario
Single-user focus	macOS is not designed for multi-tenant security

Conclusion: A Mac Studio is unsuitable for environments with strict compliance requirements (ISO27001 Annex A, NIS2, SOC2). For homelab and development it’s acceptable.

LLM-Specific Risks

Risk	Description
Non-determinism	Same input can produce different outputs
Prompt injection	Malicious pod names/labels can manipulate the LLM
Hallucinations	Model can suggest harmful remediation
Context leakage	Info from earlier queries can appear in responses
Supply chain	Model weights could be backdoored

Threat Model

Threat	Likelihood	Impact	Mitigation
Prompt injection via pod metadata	Medium	High	Input sanitization, output validation
Hallucinated destructive commands	Medium	Critical	Human-in-the-loop, no auto-remediation
Model weights tampering	Low	Critical	Checksum verification, trusted source
Context window data leakage	Medium	Medium	Short context, no persistent memory
Unauthorized access to inference API	Medium	High	Network segmentation, auth
Resource exhaustion (DoS)	Low	Medium	Rate limiting, resource quotas

Conclusions and Recommendations

Is a Local LLM Usable for Kubernetes Diagnosis?

Yes, under conditions.

It can:

Correctly identify standard issues
Provide usable remediation suggestions
Detect security problems
Do all this without data leaving your network

It cannot:

Debug complex, multi-component issues
Reliably do auto-remediation
Understand the context of your specific setup
Guarantee correctness

Recommendations per Use Case

Homelab / Learning

Recommendation: Go for it.

This is an excellent way to learn about:

LLM inference infrastructure
Kubernetes troubleshooting patterns
The limits of AI-assisted operations

Risks are acceptable because the impact is limited.

Development / Staging

Recommendation: Usable with guardrails.

Implement:

Output review before applying suggestions
Logging of all LLM interactions
No auto-remediation, diagnosis only

Production (not air-gapped)

Recommendation: Use cloud APIs.

Why:

Better models (GPT-4, Claude)
Lower latency
No hardware investment
Professional SLAs

The privacy trade-off is acceptable for most organizations if you don’t have PII in cluster metadata.

Production (air-gapped / sovereign)

Recommendation: Only as last resort.

If you truly cannot send data outside:

Consider smaller, dedicated models
Implement defense-in-depth for the inference server
Treat all LLM output as untrusted
Ensure extensive logging and audit trails
Use this as assistance, never as authority

The State of Autonomous Cluster Management

Let me be direct: “autonomous cluster management” with LLMs is currently marketing, not reality.

What we have is “assisted cluster management” — an AI that can help with diagnosis and make suggestions. But the human-in-the-loop is not optional. It’s a requirement.

The technology is impressive. A 70B model can produce surprisingly good analyses. But surprisingly good is not good enough for autonomous action on production infrastructure.

My advice: use these tools as a smart colleague you can consult. Not as a replacement for your own judgment.

Related posts:

Sovereign Infrastructure — Why I self-host everything
Why Privacy Matters — The context for local LLMs

Hardware and Software Stack#

The Hardware#

The Software Stack#

Installation: vLLM with Metal Backend#

Model Download#

Starting the vLLM Server#

K8sGPT Operator Deployment#

Test Scenarios#

Scenario A: CrashLoopBackOff Diagnosis#

Scenario B: Resource Recommendations#

Scenario C: Security Audit#

Performance Metrics#

Inference Speed#

End-to-End Latency#

Resource Usage#

Comparison with OpenAI API#

Air-Gapped Deployment#

What You Need to Download Beforehand#

Transport and Installation#

Air-Gap Friendly Components#

Security Analysis and Threat Model#

Platform Security Issues#

LLM-Specific Risks#

Threat Model#

Conclusions and Recommendations#

Is a Local LLM Usable for Kubernetes Diagnosis?#

Recommendations per Use Case#

Homelab / Learning#

Development / Staging#

Production (not air-gapped)#

Production (air-gapped / sovereign)#

The State of Autonomous Cluster Management#