Thanos Remote Write: Push-Based Metrics for Edge and Multi-Cluster

In my previous post on Prometheus and Thanos, I covered the sidecar architecture — Thanos Sidecar runs alongside Prometheus, uploads TSDB blocks to object storage, and exposes data to the Querier. It works beautifully for clusters with stable connectivity to your central infrastructure.

But what happens when your clusters are at the edge? When they might lose connectivity for hours or days? When you’re running dozens or hundreds of small clusters and don’t want sidecar complexity on each one?

That’s where Thanos Receive comes in. Push instead of pull.

The Problem with Sidecars at Scale

The sidecar model requires:

Thanos Sidecar running on every Prometheus instance
Direct object storage access from every cluster
Stable network connectivity for block uploads
gRPC connectivity to the central Querier for real-time queries

For a handful of clusters in the same datacenter, this is fine. But consider:

Edge locations with intermittent internet connectivity
Air-gapped environments that can only push outbound
Hundreds of small clusters where sidecar overhead adds up
Multi-tenant platforms where you don’t control all clusters

The sidecar model breaks down. You need something simpler.

Enter Thanos Receive

Thanos Receive flips the model. Instead of sidecars pulling data out of Prometheus, Prometheus pushes metrics directly to a central Receive component.

flowchart TD
    subgraph edge["Edge Clusters"]
        subgraph E1["Site A (Factory)"]
            P1["Prometheus"]
        end
        subgraph E2["Site B (Warehouse)"]
            P2["Prometheus"]
        end
        subgraph E3["Site C (Retail)"]
            P3["Prometheus"]
        end
    end

    subgraph central["Central Cluster"]
        R["Thanos Receive<br/>(HA cluster)"]
        Q["Thanos Querier"]
        SG["Store Gateway"]
        C["Compactor"]
        OS["Object Storage"]
    end

    P1 -->|"remote_write"| R
    P2 -->|"remote_write"| R
    P3 -->|"remote_write"| R

    R --> OS
    OS --> SG
    SG --> Q
    R --> Q
    OS --> C

    Q --> G["Grafana"]

The key insight: Prometheus already has remote_write. It’s a built-in feature that pushes samples to any compatible endpoint. Thanos Receive implements this API.

Why Push is Better for Edge

1. Simpler Edge Deployment

No sidecar. No object storage credentials at the edge. No gRPC ports to expose. Just Prometheus with a remote_write URL.

# The ENTIRE edge configuration
global:
  external_labels:
    cluster: factory-site-a
    region: europe-west

remote_write:
  - url: https://thanos-receive.central.example.com/api/v1/receive
    bearer_token_file: /etc/prometheus/token

That’s it. The edge cluster doesn’t need to know about object storage, Thanos components, or anything else. It just pushes metrics.

2. Tolerates Connectivity Loss

When connectivity drops, Prometheus buffers samples locally. When connectivity returns, it catches up. The Write-Ahead Log (WAL) handles this automatically.

remote_write:
  - url: https://thanos-receive.central.example.com/api/v1/receive
    queue_config:
      capacity: 100000        # Buffer size
      max_shards: 50          # Parallel send workers
      min_shards: 1
      max_samples_per_send: 5000
      batch_send_deadline: 30s
      min_backoff: 1s
      max_backoff: 5m         # Exponential backoff on failure

I’ve seen edge clusters lose connectivity for 4+ hours and successfully catch up when the network returned. The WAL kept everything safe.

3. Outbound-Only Connectivity

Edge sites often have restrictive firewalls. Inbound connections? Blocked. VPNs? Nightmare to maintain. But outbound HTTPS to a known endpoint? Usually allowed.

Push architecture aligns with how edge networks actually work.

4. Lower Resource Usage at Edge

No sidecar means:

Less memory on edge nodes
No additional container to manage
No gRPC connections to maintain
Simpler failure modes

For resource-constrained edge devices, this matters.

Setting Up Thanos Receive

Central Cluster: Receive Deployment

Thanos Receive is stateful — it writes to disk before uploading to object storage. Run it as a StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: thanos-receive
  namespace: monitoring
spec:
  serviceName: thanos-receive
  replicas: 3
  selector:
    matchLabels:
      app: thanos-receive
  template:
    metadata:
      labels:
        app: thanos-receive
    spec:
      containers:
        - name: thanos-receive
          image: quay.io/thanos/thanos:v0.34.0
          args:
            - receive
            - --http-address=0.0.0.0:10902
            - --grpc-address=0.0.0.0:10901
            - --remote-write.address=0.0.0.0:19291
            - --tsdb.path=/var/thanos/receive
            - --objstore.config-file=/etc/thanos/objstore.yaml
            - --label=receive_replica="$(POD_NAME)"
            - --tsdb.retention=6h
            # Hashring for HA distribution
            - --receive.hashrings-file=/etc/thanos/hashring.json
            - --receive.local-endpoint=$(POD_NAME).thanos-receive.monitoring.svc:10901
          ports:
            - name: http
              containerPort: 10902
            - name: grpc
              containerPort: 10901
            - name: remote-write
              containerPort: 19291
          volumeMounts:
            - name: data
              mountPath: /var/thanos/receive
            - name: objstore-config
              mountPath: /etc/thanos
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
      volumes:
        - name: objstore-config
          secret:
            secretName: thanos-objstore-secret
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        storageClassName: longhorn
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi

Hashring for High Availability

When running multiple Receive replicas, you need a hashring to distribute incoming data:

[
  {
    "endpoints": [
      "thanos-receive-0.thanos-receive.monitoring.svc:10901",
      "thanos-receive-1.thanos-receive.monitoring.svc:10901",
      "thanos-receive-2.thanos-receive.monitoring.svc:10901"
    ]
  }
]

Store this in a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-receive-hashring
  namespace: monitoring
data:
  hashring.json: |
    [
      {
        "endpoints": [
          "thanos-receive-0.thanos-receive.monitoring.svc:10901",
          "thanos-receive-1.thanos-receive.monitoring.svc:10901",
          "thanos-receive-2.thanos-receive.monitoring.svc:10901"
        ]
      }
    ]

The hashring ensures samples from the same time series always go to the same Receive instance, preventing duplicates.

Load Balancer Service

Expose Receive for external clusters:

apiVersion: v1
kind: Service
metadata:
  name: thanos-receive-lb
  namespace: monitoring
spec:
  type: LoadBalancer
  ports:
    - name: remote-write
      port: 443
      targetPort: 19291
  selector:
    app: thanos-receive
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: thanos-receive
  namespace: monitoring
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
  tls:
    - hosts:
        - thanos-receive.example.com
      secretName: thanos-receive-tls
  rules:
    - host: thanos-receive.example.com
      http:
        paths:
          - path: /api/v1/receive
            pathType: Prefix
            backend:
              service:
                name: thanos-receive
                port:
                  number: 19291

Edge Cluster: Prometheus Configuration

On each edge cluster, configure Prometheus to push metrics:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  retention: 24h  # Local retention for edge queries

  externalLabels:
    cluster: factory-site-a
    region: europe-west
    environment: production

  remoteWrite:
    - url: https://thanos-receive.example.com/api/v1/receive
      bearerTokenSecret:
        name: thanos-remote-write-token
        key: token
      queueConfig:
        capacity: 50000
        maxShards: 20
        minShards: 1
        maxSamplesPerSend: 2000
        batchSendDeadline: 30s
        minBackoff: 1s
        maxBackoff: 5m
      writeRelabelConfigs:
        # Drop high-cardinality metrics you don't need centrally
        - sourceLabels: [__name__]
          regex: 'go_.*'
          action: drop

Important: Set meaningful externalLabels. These identify which cluster the metrics came from. Without them, you can’t distinguish data from different sites.

Authentication and Security

Never expose Thanos Receive without authentication. Options:

Bearer Token (Simple)

Generate a token and share it with edge clusters:

# Generate token
openssl rand -hex 32 > receive-token.txt

# Create secret in central cluster
kubectl create secret generic thanos-receive-auth \
  --from-file=token=receive-token.txt \
  -n monitoring

# Create secret in edge cluster
kubectl create secret generic thanos-remote-write-token \
  --from-file=token=receive-token.txt \
  -n monitoring

Configure Receive to validate tokens:

# Add to Receive container args
- --receive.tenant-header=THANOS-TENANT
- --receive.default-tenant-id=anonymous

mTLS (More Secure)

For environments requiring stronger authentication:

# Receive args
- --grpc-server-tls-cert=/etc/tls/server.crt
- --grpc-server-tls-key=/etc/tls/server.key
- --grpc-server-tls-client-ca=/etc/tls/ca.crt

# Edge Prometheus remoteWrite
remoteWrite:
  - url: https://thanos-receive.example.com/api/v1/receive
    tlsConfig:
      certFile: /etc/prometheus/tls/client.crt
      keyFile: /etc/prometheus/tls/client.key
      caFile: /etc/prometheus/tls/ca.crt

Multi-Tenancy

Thanos Receive supports multi-tenancy through the tenant header:

# Edge cluster A
remoteWrite:
  - url: https://thanos-receive.example.com/api/v1/receive
    headers:
      THANOS-TENANT: tenant-a

# Edge cluster B
remoteWrite:
  - url: https://thanos-receive.example.com/api/v1/receive
    headers:
      THANOS-TENANT: tenant-b

Data is stored with the tenant label, enabling isolation in queries:

# Query only tenant-a data
http_requests_total{tenant_id="tenant-a"}

This is useful when different teams or customers share the same Receive infrastructure.

Handling Intermittent Connectivity

The killer feature for edge: graceful degradation when connectivity fails.

Prometheus WAL Buffering

Prometheus buffers unsent samples in its Write-Ahead Log. Configure for your connectivity patterns:

remoteWrite:
  - url: https://thanos-receive.example.com/api/v1/receive
    queueConfig:
      capacity: 100000       # Samples to buffer
      maxShards: 30          # Parallel senders when catching up
      maxSamplesPerSend: 5000
      batchSendDeadline: 60s
      minBackoff: 1s
      maxBackoff: 10m        # Don't hammer when down

For a typical edge cluster scraping every 30s:

100k sample buffer ≈ 3-4 hours of metrics
Adjust based on your scrape interval and target count

Monitor Remote Write Health

On the edge cluster, alert on remote write failures:

- alert: PrometheusRemoteWriteFailing
  expr: |
    rate(prometheus_remote_storage_failed_samples_total[5m]) > 0
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Prometheus remote write failing"
    description: "Edge cluster {{ $labels.cluster }} failing to send metrics"

Central Monitoring of Edge Health

On the central cluster, detect silent edges:

- alert: EdgeClusterMissing
  expr: |
    absent_over_time(up{job="prometheus", cluster=~"edge-.*"}[1h])
  for: 30m
  labels:
    severity: critical
  annotations:
    summary: "Edge cluster not reporting"
    description: "No metrics from {{ $labels.cluster }} for 1 hour"

Sidecar vs Receive: When to Use Which

Aspect	Sidecar	Receive
Edge/offline support	Poor	Excellent
Real-time queries	Yes (via gRPC)	Yes (via Receive gRPC)
Resource at edge	Higher	Lower
Object storage access	Required at edge	Central only
Network direction	Outbound + inbound	Outbound only
Complexity at edge	Higher	Lower
Block upload latency	2h (block size)	Real-time

Use Sidecar when:

All clusters have stable connectivity
You want minimal central infrastructure
Real-time queries across all clusters matter

Use Receive when:

Edge clusters have intermittent connectivity
You’re scaling to many (50+) clusters
Edge resources are constrained
Outbound-only firewall rules

Hybrid approach:

Sidecar for core/stable clusters
Receive for edge/remote clusters
Both feed the same Querier and object storage

My Production Setup

I run a hybrid model:

# Central infrastructure cluster: Sidecar
# (stable connectivity, needs real-time queries)
prometheus:
  thanos:
    enabled: true
    objectStorageConfig:
      name: thanos-objstore

# Edge manufacturing sites: Remote Write
# (intermittent connectivity, resource constrained)
prometheus:
  remoteWrite:
    - url: https://thanos-receive.central:443/api/v1/receive
      bearerTokenSecret:
        name: edge-token
      queueConfig:
        capacity: 150000  # 6+ hours buffer
        maxBackoff: 15m

# All feed the same Thanos Query
thanos:
  query:
    stores:
      - dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc  # Sidecars
      - dnssrv+_grpc._tcp.thanos-receive.monitoring.svc        # Receive
      - dnssrv+_grpc._tcp.thanos-store.monitoring.svc          # Historical

Edge sites can lose connectivity for hours. When they reconnect, metrics catch up automatically. The central Grafana dashboards show the full picture regardless of which architecture fed the data.

Why This Matters

Monitoring isn’t optional. You need to see what’s happening across all your clusters — not just the ones with perfect connectivity.

The push model with Thanos Receive gives you:

Resilience: Edge clusters survive connectivity loss
Simplicity: No sidecars, no object storage creds at edge
Scalability: Centralize complexity, keep edges simple
Flexibility: Mix sidecar and receive as needed

This is how you build observability that works in the real world, where networks fail and edge locations exist.

Pull works when everything is connected. Push works when reality intervenes. Good architectures handle both.

The Problem with Sidecars at Scale#

Enter Thanos Receive#

Why Push is Better for Edge#

1. Simpler Edge Deployment#

2. Tolerates Connectivity Loss#

3. Outbound-Only Connectivity#

4. Lower Resource Usage at Edge#

Setting Up Thanos Receive#

Central Cluster: Receive Deployment#

Hashring for High Availability#

Load Balancer Service#

Edge Cluster: Prometheus Configuration#

Authentication and Security#

Bearer Token (Simple)#

mTLS (More Secure)#

Multi-Tenancy#

Handling Intermittent Connectivity#

Prometheus WAL Buffering#

Monitor Remote Write Health#

Central Monitoring of Edge Health#

Sidecar vs Receive: When to Use Which#

My Production Setup#

Why This Matters#