In my previous post on Prometheus and Thanos, I covered the sidecar architecture — Thanos Sidecar runs alongside Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier. It works beautifully for clusters with stable connectivity to your central infrastructure.
But what happens when your clusters are at the edge? When they might lose connectivity for hours or days? When you’re running dozens or hundreds of small clusters and don’t want sidecar complexity on each one?
That’s where Thanos Receive comes in. Push instead of pull.
The Problem with Sidecars at Scale
The sidecar model requires:
- Thanos Sidecar running on every Prometheus instance
- Direct object storage access from every cluster
- Stable network connectivity for block uploads
- gRPC connectivity to the central Querier for real-time queries
For a handful of clusters in the same datacenter, this is fine. But consider:
- Edge locations with intermittent internet connectivity
- Air-gapped environments that can only push outbound
- Hundreds of small clusters where sidecar overhead adds up
- Multi-tenant platforms where you don’t control all clusters
The sidecar model breaks down. You need something simpler.
Enter Thanos Receive
Thanos Receive flips the model. Instead of sidecars pulling data out of Prometheus, Prometheus pushes metrics directly to a central Receive component.
flowchart TD
subgraph edge["Edge Clusters"]
subgraph E1["Site A (Factory)"]
P1["Prometheus"]
end
subgraph E2["Site B (Warehouse)"]
P2["Prometheus"]
end
subgraph E3["Site C (Retail)"]
P3["Prometheus"]
end
end
subgraph central["Central Cluster"]
R["Thanos Receive<br/>(HA cluster)"]
Q["Thanos Querier"]
SG["Store Gateway"]
C["Compactor"]
OS["Object Storage"]
end
P1 -->|"remote_write"| R
P2 -->|"remote_write"| R
P3 -->|"remote_write"| R
R --> OS
OS --> SG
SG --> Q
R --> Q
OS --> C
Q --> G["Grafana"]
The key insight: Prometheus already has remote_write. It’s a built-in feature that pushes samples to any compatible endpoint. Thanos Receive implements this API.
Why Push is Better for Edge
1. Simpler Edge Deployment
No sidecar. No object storage credentials at the edge. No gRPC ports to expose. Just Prometheus with a remote_write URL.
# The ENTIRE edge configuration
global:
external_labels:
cluster: factory-site-a
region: europe-west
remote_write:
- url: https://thanos-receive.central.example.com/api/v1/receive
bearer_token_file: /etc/prometheus/token
That’s it. The edge cluster doesn’t need to know about object storage, Thanos components, or anything else. It just pushes metrics.
2. Tolerates Connectivity Loss
When connectivity drops, Prometheus buffers samples locally. When connectivity returns, it catches up. The Write-Ahead Log (WAL) handles this automatically.
remote_write:
- url: https://thanos-receive.central.example.com/api/v1/receive
queue_config:
capacity: 100000 # Buffer size
max_shards: 50 # Parallel send workers
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 30s
min_backoff: 1s
max_backoff: 5m # Exponential backoff on failure
I’ve seen edge clusters lose connectivity for 4+ hours and successfully catch up when the network returned. The WAL kept everything safe.
3. Outbound-Only Connectivity
Edge sites often have restrictive firewalls. Inbound connections? Blocked. VPNs? Nightmare to maintain. But outbound HTTPS to a known endpoint? Usually allowed.
Push architecture aligns with how edge networks actually work.
4. Lower Resource Usage at Edge
No sidecar means:
- Less memory on edge nodes
- No additional container to manage
- No gRPC connections to maintain
- Simpler failure modes
For resource-constrained edge devices, this matters.
Setting Up Thanos Receive
Central Cluster: Receive Deployment
Thanos Receive is stateful — it writes to disk before uploading to object storage. Run it as a StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-receive
namespace: monitoring
spec:
serviceName: thanos-receive
replicas: 3
selector:
matchLabels:
app: thanos-receive
template:
metadata:
labels:
app: thanos-receive
spec:
containers:
- name: thanos-receive
image: quay.io/thanos/thanos:v0.34.0
args:
- receive
- --http-address=0.0.0.0:10902
- --grpc-address=0.0.0.0:10901
- --remote-write.address=0.0.0.0:19291
- --tsdb.path=/var/thanos/receive
- --objstore.config-file=/etc/thanos/objstore.yaml
- --label=receive_replica="$(POD_NAME)"
- --tsdb.retention=6h
# Hashring for HA distribution
- --receive.hashrings-file=/etc/thanos/hashring.json
- --receive.local-endpoint=$(POD_NAME).thanos-receive.monitoring.svc:10901
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
- name: remote-write
containerPort: 19291
volumeMounts:
- name: data
mountPath: /var/thanos/receive
- name: objstore-config
mountPath: /etc/thanos
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumes:
- name: objstore-config
secret:
secretName: thanos-objstore-secret
volumeClaimTemplates:
- metadata:
name: data
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
Hashring for High Availability
When running multiple Receive replicas, you need a hashring to distribute incoming data:
[
{
"endpoints": [
"thanos-receive-0.thanos-receive.monitoring.svc:10901",
"thanos-receive-1.thanos-receive.monitoring.svc:10901",
"thanos-receive-2.thanos-receive.monitoring.svc:10901"
]
}
]
Store this in a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: thanos-receive-hashring
namespace: monitoring
data:
hashring.json: |
[
{
"endpoints": [
"thanos-receive-0.thanos-receive.monitoring.svc:10901",
"thanos-receive-1.thanos-receive.monitoring.svc:10901",
"thanos-receive-2.thanos-receive.monitoring.svc:10901"
]
}
]
The hashring ensures samples from the same time series always go to the same Receive instance, preventing duplicates.
Load Balancer Service
Expose Receive for external clusters:
apiVersion: v1
kind: Service
metadata:
name: thanos-receive-lb
namespace: monitoring
spec:
type: LoadBalancer
ports:
- name: remote-write
port: 443
targetPort: 19291
selector:
app: thanos-receive
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: thanos-receive
namespace: monitoring
annotations:
cert-manager.io/cluster-issuer: letsencrypt
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
tls:
- hosts:
- thanos-receive.example.com
secretName: thanos-receive-tls
rules:
- host: thanos-receive.example.com
http:
paths:
- path: /api/v1/receive
pathType: Prefix
backend:
service:
name: thanos-receive
port:
number: 19291
Edge Cluster: Prometheus Configuration
On each edge cluster, configure Prometheus to push metrics:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
retention: 24h # Local retention for edge queries
externalLabels:
cluster: factory-site-a
region: europe-west
environment: production
remoteWrite:
- url: https://thanos-receive.example.com/api/v1/receive
bearerTokenSecret:
name: thanos-remote-write-token
key: token
queueConfig:
capacity: 50000
maxShards: 20
minShards: 1
maxSamplesPerSend: 2000
batchSendDeadline: 30s
minBackoff: 1s
maxBackoff: 5m
writeRelabelConfigs:
# Drop high-cardinality metrics you don't need centrally
- sourceLabels: [__name__]
regex: 'go_.*'
action: drop
Important: Set meaningful externalLabels. These identify which cluster the metrics came from. Without them, you can’t distinguish data from different sites.
Authentication and Security
Never expose Thanos Receive without authentication. Options:
Bearer Token (Simple)
Generate a token and share it with edge clusters:
# Generate token
openssl rand -hex 32 > receive-token.txt
# Create secret in central cluster
kubectl create secret generic thanos-receive-auth \
--from-file=token=receive-token.txt \
-n monitoring
# Create secret in edge cluster
kubectl create secret generic thanos-remote-write-token \
--from-file=token=receive-token.txt \
-n monitoring
Configure Receive to validate tokens:
# Add to Receive container args
- --receive.tenant-header=THANOS-TENANT
- --receive.default-tenant-id=anonymous
mTLS (More Secure)
For environments requiring stronger authentication:
# Receive args
- --grpc-server-tls-cert=/etc/tls/server.crt
- --grpc-server-tls-key=/etc/tls/server.key
- --grpc-server-tls-client-ca=/etc/tls/ca.crt
# Edge Prometheus remoteWrite
remoteWrite:
- url: https://thanos-receive.example.com/api/v1/receive
tlsConfig:
certFile: /etc/prometheus/tls/client.crt
keyFile: /etc/prometheus/tls/client.key
caFile: /etc/prometheus/tls/ca.crt
Multi-Tenancy
Thanos Receive supports multi-tenancy through the tenant header:
# Edge cluster A
remoteWrite:
- url: https://thanos-receive.example.com/api/v1/receive
headers:
THANOS-TENANT: tenant-a
# Edge cluster B
remoteWrite:
- url: https://thanos-receive.example.com/api/v1/receive
headers:
THANOS-TENANT: tenant-b
Data is stored with the tenant label, enabling isolation in queries:
# Query only tenant-a data
http_requests_total{tenant_id="tenant-a"}
This is useful when different teams or customers share the same Receive infrastructure.
Handling Intermittent Connectivity
The killer feature for edge: graceful degradation when connectivity fails.
Prometheus WAL Buffering
Prometheus buffers unsent samples in its Write-Ahead Log. Configure for your connectivity patterns:
remoteWrite:
- url: https://thanos-receive.example.com/api/v1/receive
queueConfig:
capacity: 100000 # Samples to buffer
maxShards: 30 # Parallel senders when catching up
maxSamplesPerSend: 5000
batchSendDeadline: 60s
minBackoff: 1s
maxBackoff: 10m # Don't hammer when down
For a typical edge cluster scraping every 30s:
- 100k sample buffer ≈ 3-4 hours of metrics
- Adjust based on your scrape interval and target count
Monitor Remote Write Health
On the edge cluster, alert on remote write failures:
- alert: PrometheusRemoteWriteFailing
expr: |
rate(prometheus_remote_storage_failed_samples_total[5m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Prometheus remote write failing"
description: "Edge cluster {{ $labels.cluster }} failing to send metrics"
Central Monitoring of Edge Health
On the central cluster, detect silent edges:
- alert: EdgeClusterMissing
expr: |
absent_over_time(up{job="prometheus", cluster=~"edge-.*"}[1h])
for: 30m
labels:
severity: critical
annotations:
summary: "Edge cluster not reporting"
description: "No metrics from {{ $labels.cluster }} for 1 hour"
Sidecar vs Receive: When to Use Which
| Aspect | Sidecar | Receive |
|---|---|---|
| Edge/offline support | Poor | Excellent |
| Real-time queries | Yes (via gRPC) | Yes (via Receive gRPC) |
| Resource at edge | Higher | Lower |
| Object storage access | Required at edge | Central only |
| Network direction | Outbound + inbound | Outbound only |
| Complexity at edge | Higher | Lower |
| Block upload latency | 2h (block size) | Real-time |
Use Sidecar when:
- All clusters have stable connectivity
- You want minimal central infrastructure
- Real-time queries across all clusters matter
Use Receive when:
- Edge clusters have intermittent connectivity
- You’re scaling to many (50+) clusters
- Edge resources are constrained
- Outbound-only firewall rules
Hybrid approach:
- Sidecar for core/stable clusters
- Receive for edge/remote clusters
- Both feed the same Querier and object storage
My Production Setup
I run a hybrid model:
# Central infrastructure cluster: Sidecar
# (stable connectivity, needs real-time queries)
prometheus:
thanos:
enabled: true
objectStorageConfig:
name: thanos-objstore
# Edge manufacturing sites: Remote Write
# (intermittent connectivity, resource constrained)
prometheus:
remoteWrite:
- url: https://thanos-receive.central:443/api/v1/receive
bearerTokenSecret:
name: edge-token
queueConfig:
capacity: 150000 # 6+ hours buffer
maxBackoff: 15m
# All feed the same Thanos Query
thanos:
query:
stores:
- dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc # Sidecars
- dnssrv+_grpc._tcp.thanos-receive.monitoring.svc # Receive
- dnssrv+_grpc._tcp.thanos-store.monitoring.svc # Historical
Edge sites can lose connectivity for hours. When they reconnect, metrics catch up automatically. The central Grafana dashboards show the full picture regardless of which architecture fed the data.
Why This Matters
Monitoring isn’t optional. You need to see what’s happening across all your clusters — not just the ones with perfect connectivity.
The push model with Thanos Receive gives you:
- Resilience: Edge clusters survive connectivity loss
- Simplicity: No sidecars, no object storage creds at edge
- Scalability: Centralize complexity, keep edges simple
- Flexibility: Mix sidecar and receive as needed
This is how you build observability that works in the real world, where networks fail and edge locations exist.
Pull works when everything is connected. Push works when reality intervenes. Good architectures handle both.
