You’ve got Prometheus for metrics. You can see what’s happening across your clusters. But when something breaks, metrics tell you that something is wrong — logs tell you why.

The traditional answer is Elasticsearch. It’s powerful, flexible, and… expensive. It indexes everything, which means you pay for every byte of log data in CPU, memory, and storage.

Loki takes a different approach: index labels, not content. It’s the same philosophy that makes Prometheus efficient for metrics, applied to logs.

Why Loki?

Loki was designed by Grafana Labs with specific goals:

  1. Cost efficient — Only index metadata (labels), store log lines compressed
  2. Kubernetes native — Labels from Kubernetes metadata automatically
  3. Grafana integration — Same dashboards, same alerting, same workflow
  4. Operationally simple — No JVM tuning, no cluster management complexity

The trade-off: you can’t do full-text search across all logs efficiently. You need to know which labels to filter on first. For Kubernetes workloads where you’re usually looking at specific pods, namespaces, or services, this is fine.

Architecture

flowchart TD
    subgraph cluster["Kubernetes Cluster"]
        subgraph nodes["Nodes"]
            P1["Promtail<br/>(DaemonSet)"]
            P2["Promtail"]
            P3["Promtail"]
        end
        PODS["Pod Logs<br/>(/var/log/pods)"]
    end

    P1 --> PODS
    P2 --> PODS
    P3 --> PODS

    P1 --> L["Loki"]
    P2 --> L
    P3 --> L

    L --> OS["Object Storage<br/>(chunks)"]
    L --> G["Grafana"]

Promtail runs on each node as a DaemonSet, tails log files, adds labels, and ships to Loki.

Loki receives logs, indexes by labels, compresses chunks, stores in object storage.

Grafana queries Loki using LogQL, displays in Explore or dashboards.

Installing Loki

Using the Grafana Helm charts:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki with simple scalable deployment
helm install loki grafana/loki \
  --namespace monitoring \
  --create-namespace \
  --values loki-values.yaml

Basic values for a single-binary deployment (good for small clusters):

# loki-values.yaml
loki:
  auth_enabled: false

  commonConfig:
    replication_factor: 1

  storage:
    type: filesystem

  schemaConfig:
    configs:
      - from: 2024-01-01
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: index_
          period: 24h

singleBinary:
  replicas: 1
  persistence:
    size: 50Gi

# Disable components not needed for single binary
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

For production with object storage:

# loki-values.yaml
loki:
  auth_enabled: false

  commonConfig:
    replication_factor: 3

  storage:
    type: s3
    s3:
      endpoint: minio.storage:9000
      bucketnames: loki-chunks
      access_key_id: ${MINIO_ACCESS_KEY}
      secret_access_key: ${MINIO_SECRET_KEY}
      insecure: true

  schemaConfig:
    configs:
      - from: 2024-01-01
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: index_
          period: 24h

# Scalable deployment
backend:
  replicas: 3
read:
  replicas: 3
write:
  replicas: 3

Installing Promtail

Promtail collects logs from your nodes:

helm install promtail grafana/promtail \
  --namespace monitoring \
  --set config.clients[0].url=http://loki:3100/loki/api/v1/push

Promtail configuration for Kubernetes:

# promtail-values.yaml
config:
  clients:
    - url: http://loki:3100/loki/api/v1/push

  snippets:
    # Add Kubernetes metadata as labels
    pipelineStages:
      - cri: {}
      - labeldrop:
          - filename
      - match:
          selector: '{app="nginx"}'
          stages:
            - regex:
                expression: '^(?P<remote_addr>[\d\.]+) - (?P<remote_user>\S+) \[(?P<time_local>[^\]]+)\] "(?P<request>[^"]+)" (?P<status>\d+)'
            - labels:
                status:

# DaemonSet tolerations for all nodes
tolerations:
  - operator: Exists

Understanding Labels

Labels are everything in Loki. They determine how logs are indexed and queried.

Default Kubernetes labels from Promtail:

LabelSourceExample
namespacePod namespacedefault, monitoring
podPod namenginx-abc123
containerContainer namenginx, sidecar
node_nameNodeworker-1
appPod labelnginx
jobScrape configkubernetes-pods

High cardinality warning: Don’t add labels that have many unique values (like request IDs, user IDs, or timestamps). This kills Loki’s performance. Keep labels to dimensions you’ll filter on.

Good labels:

  • namespace, app, environment, team

Bad labels:

  • request_id, user_id, trace_id, timestamp

LogQL: Querying Logs

LogQL is Loki’s query language. It looks like PromQL but for logs.

Basic Queries

# All logs from a namespace
{namespace="production"}

# Specific app
{app="frontend", namespace="production"}

# Multiple containers
{container=~"nginx|envoy"}

# Exclude a namespace
{namespace!="kube-system"}

Filtering Content

# Lines containing "error"
{app="frontend"} |= "error"

# Lines NOT containing "health"
{app="frontend"} != "health"

# Regex match
{app="frontend"} |~ "status=(4|5)[0-9]{2}"

# Case insensitive
{app="frontend"} |~ "(?i)error"

Parsing and Extracting

# Parse JSON logs
{app="api"} | json

# Extract specific field
{app="api"} | json | status_code >= 500

# Parse with pattern
{app="nginx"} | pattern `<ip> - - [<_>] "<method> <path> <_>" <status>`

# Use extracted fields
{app="nginx"} | pattern `<_> - - [<_>] "<method> <path> <_>" <status>` | status >= 400

Aggregations (Log Metrics)

# Count errors per app
sum by (app) (count_over_time({namespace="production"} |= "error" [5m]))

# Rate of requests
sum(rate({app="nginx"} | pattern `<_> "<method> <path> <_>" <status>` [1m])) by (status)

# Bytes per namespace
sum by (namespace) (bytes_over_time({job="kubernetes-pods"}[1h]))

Grafana Integration

Add Loki as a data source in Grafana:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  loki.yaml: |
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        url: http://loki:3100
        access: proxy
        jsonData:
          maxLines: 1000

Explore View

Grafana Explore is perfect for log investigation:

  1. Select Loki data source
  2. Build query with label browser
  3. Filter with content matches
  4. Click on log lines for context

Dashboard Panels

Add logs to your dashboards:

{
  "type": "logs",
  "datasource": "Loki",
  "targets": [
    {
      "expr": "{namespace=\"production\", app=\"frontend\"} |= \"error\"",
      "refId": "A"
    }
  ],
  "options": {
    "showTime": true,
    "showLabels": false,
    "wrapLogMessage": true
  }
}

Correlating Metrics and Logs

The power of Grafana: same dashboard shows metrics and logs.

# Prometheus panel showing error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) by (app)

# Loki panel showing error logs
{app="$app"} |= "error"

Variable $app links both panels. Click a spike in the metrics, see the errors in the logs.

Alerting on Logs

Loki supports alerting through Grafana or its own ruler:

# Grafana alert rule
apiVersion: 1
groups:
  - name: LogAlerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(count_over_time({namespace="production"} |= "error" [5m])) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate in production logs"

Using Loki’s ruler component:

# loki-rules.yaml
groups:
  - name: errors
    rules:
      - alert: CriticalError
        expr: |
          count_over_time({app="payment-service"} |= "CRITICAL" [1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Critical error in payment service"

Retention and Storage

Configure retention in Loki:

loki:
  limits_config:
    retention_period: 30d

  compactor:
    working_directory: /var/loki/compactor
    retention_enabled: true
    retention_delete_delay: 2h
    retention_delete_worker_count: 150

Per-tenant retention (if using multi-tenancy):

loki:
  limits_config:
    retention_period: 30d  # Default

  overrides:
    production:
      retention_period: 90d  # Keep production logs longer
    development:
      retention_period: 7d   # Dev logs expire faster

Performance Tuning

Chunk Size

Larger chunks = fewer index entries, better compression, higher latency for small queries:

loki:
  ingester:
    chunk_target_size: 1572864  # 1.5MB
    chunk_idle_period: 30m
    max_chunk_age: 2h

Query Limits

Prevent runaway queries:

loki:
  limits_config:
    max_query_length: 721h           # Max time range
    max_query_parallelism: 32        # Concurrent sub-queries
    max_entries_limit_per_query: 5000

Caching

Add caching for better query performance:

loki:
  memcached:
    chunk_cache:
      enabled: true
      host: memcached.monitoring
    results_cache:
      enabled: true
      host: memcached.monitoring

Structured Logging Best Practices

To get the most from Loki, log in structured format:

{
  "level": "error",
  "message": "Failed to process order",
  "order_id": "12345",
  "error": "payment declined",
  "duration_ms": 234
}

Query structured logs easily:

{app="order-service"} | json | level="error" | duration_ms > 1000

Configure your apps to output JSON:

# Spring Boot
logging.pattern.console: '{"timestamp":"%d","level":"%p","logger":"%c","message":"%m"}%n'

# Node.js with winston
const logger = winston.createLogger({
  format: winston.format.json(),
});

# Go with zap
logger, _ := zap.NewProduction()

My Production Setup

# Loki with object storage
loki:
  auth_enabled: false
  commonConfig:
    replication_factor: 3
  storage:
    type: s3
    s3:
      endpoint: minio.storage:9000
      bucketnames: loki-data
  limits_config:
    retention_period: 30d
    ingestion_rate_mb: 10
    ingestion_burst_size_mb: 20

# Three-replica deployment
backend:
  replicas: 3
  persistence:
    size: 50Gi
read:
  replicas: 3
write:
  replicas: 3
  persistence:
    size: 50Gi

# Promtail on all nodes
promtail:
  tolerations:
    - operator: Exists
  config:
    clients:
      - url: http://loki:3100/loki/api/v1/push

Key decisions:

  • Object storage: MinIO for sovereignty, no cloud dependency
  • 30-day retention: Enough for debugging, not infinite
  • Three replicas: Survives node failures
  • All nodes: Promtail runs everywhere, including control plane

Loki vs Elasticsearch

AspectLokiElasticsearch
IndexingLabels onlyFull-text
Storage costLowerHigher
Query flexibilityLabel-firstFull-text search
OperationsSimplerComplex
Memory usageLowerHigher (JVM)
Grafana integrationNativeGood

Choose Loki when:

  • You query by known dimensions (namespace, app, pod)
  • Cost matters
  • You want operational simplicity
  • You’re already in the Grafana ecosystem

Choose Elasticsearch when:

  • You need full-text search across all logs
  • You don’t know what you’re looking for
  • Log analytics is a primary use case

Why This Matters

Logs are the narrative of your system. Metrics tell you the health score, logs tell you the story.

With Loki, you get:

  • Affordable log retention — Keep logs without breaking the budget
  • Kubernetes-native labels — Query by what matters (namespace, app, pod)
  • Unified observability — Same Grafana, same workflow as metrics

Combined with Prometheus/Thanos for metrics and traces (covered separately), you have complete observability without the operational complexity of the ELK stack.


Logs are the story your system tells about itself. Loki makes sure you can afford to listen.