Your cluster is gone. Complete failure. The cloud region is down, the hardware died, or someone ran the wrong terraform destroy. Everything is gone.

Now what?

If you’ve been doing GitOps right, the answer is: spin up a new cluster, point ArgoCD at Git, wait. Your entire infrastructure recreates itself.

This is the ultimate promise of GitOps: Git is your backup.

Why GitOps Changes Disaster Recovery

Traditional DR involves:

  • Regular backups of cluster state
  • Backup storage (etcd snapshots, Velero backups)
  • Tested restore procedures
  • Recovery time measured in hours

GitOps DR is different:

  • Cluster state IS the backup (it’s in Git)
  • No separate backup infrastructure needed
  • Recovery = bootstrap + sync
  • Recovery time measured in minutes

The difference comes from treating cluster state as derived, not primary. The primary data is in Git. Clusters are disposable.

flowchart TD
    subgraph traditional["Traditional"]
        TC["Cluster<br/>(primary)"] -->|backup| TB["Backup<br/>(secondary)"]
    end

    subgraph gitops["GitOps"]
        GG["Git<br/>(primary)"] -->|sync| GC["Cluster<br/>(derived)"]
    end

What You Need for GitOps DR

For this to work, everything must be in Git:

1. Application Manifests

All your workload definitions:

  • Deployments, Services, ConfigMaps
  • Helm charts with values files
  • Kustomize overlays

This is the easy part. Most teams already have this.

2. Infrastructure Configuration

Cluster-level resources:

  • Namespaces
  • RBAC (Roles, RoleBindings, ServiceAccounts)
  • NetworkPolicies
  • ResourceQuotas
  • PodSecurityPolicies / Pod Security Standards

Often forgotten, but critical for security and isolation.

3. Platform Components

The tools your applications depend on:

  • Ingress controllers
  • Cert-manager
  • Monitoring stack (Prometheus, Grafana)
  • Logging (Loki, Fluentd)
  • Service mesh (if used)

4. Secrets Strategy

Secrets are special — you can’t store plain secrets in Git. Options:

  • Sealed Secrets: Encrypted in Git, decrypted in cluster
  • External Secrets: Fetch from Vault/AWS Secrets Manager
  • SOPS: Encrypt files with keys you control

The important thing: secrets must be reproducible, not stored only in the cluster.

5. Persistent Data

Here’s the hard truth: GitOps doesn’t back up data.

Databases, file storage, stateful workloads — these need separate backup strategies. GitOps recreates the infrastructure that runs your data, not the data itself.

The Recovery Process

Let’s walk through recovering a complete cluster failure.

Step 0: Have a Recovery Playbook

Before disaster strikes, document:

  • How to provision a new cluster
  • Bootstrap commands for ArgoCD
  • Vault access and backup location
  • External dependencies (DNS, load balancers)

Test this regularly. A playbook you’ve never run is fiction.

Step 1: Provision New Cluster

Using whatever method you prefer:

# Terraform
cd terraform/clusters/production
terraform apply

# Or managed Kubernetes
gcloud container clusters create production-recovery \
  --region europe-west1 \
  --num-nodes 3

# Or bare metal
talosctl apply-config --nodes 10.0.0.10 --file controlplane.yaml

You need a functional Kubernetes cluster. How you get there is separate from GitOps.

Step 2: Install ArgoCD

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Wait for it to be ready
kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s

Step 3: Restore Secrets Access

Configure External Secrets to connect to Vault:

# Configure Vault authentication for the new cluster
kubectl create secret generic vault-token \
  -n external-secrets \
  --from-literal=token=$VAULT_TOKEN

Step 4: Connect ArgoCD to Git

# Add your GitOps repository
argocd repo add https://github.com/yourorg/gitops-repo.git \
  --username git \
  --password $GIT_TOKEN

# Or via Secret
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
  name: gitops-repo
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: repository
stringData:
  url: https://github.com/yourorg/gitops-repo.git
  username: git
  password: ${GIT_TOKEN}
EOF

Step 5: Apply the Root Application

If you use App-of-Apps pattern (you should):

kubectl apply -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/yourorg/gitops-repo.git
    targetRevision: main
    path: clusters/production
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
EOF

Step 6: Wait

ArgoCD now reconciles everything:

  1. Discovers all applications in the root app
  2. Creates namespaces and RBAC
  3. Deploys platform components
  4. Deploys your applications

Watch progress:

argocd app list
argocd app get root --refresh

Or in the UI: kubectl port-forward svc/argocd-server -n argocd 8080:443

Step 7: Restore Data

This is where your data backup strategy comes in:

# Restore database
pg_restore -d myapp < /backups/myapp-latest.dump

# Or Velero restore
velero restore create --from-backup production-daily-latest

Step 8: Update DNS / Load Balancers

Point external traffic to the new cluster:

  • Update DNS records
  • Reconfigure CDN origins
  • Update load balancer targets

Recovery Time Objective

How fast can you recover? In my experience:

PhaseTime
Cluster provisioning5-15 min
ArgoCD install2-3 min
Secrets setup1-2 min
GitOps sync (small cluster)5-10 min
GitOps sync (large cluster)15-30 min
Data restoreVaries wildly
DNS propagation5-60 min

Total for infrastructure: 20-60 minutes

Data restore depends on your data volume and backup strategy. This is usually the longest part.

What Can Go Wrong

Missing Secrets

If Vault becomes inaccessible, External Secrets can’t fetch secrets. Your workloads won’t start.

Mitigation: Run Vault in HA mode on separate infrastructure. Backup Vault regularly. Consider Vault replication to a second location.

External Dependencies

Your cluster might depend on things not in Git:

  • Cloud IAM roles
  • DNS zones
  • SSL certificates from Let’s Encrypt
  • External databases

Mitigation: Document all external dependencies. Better: manage them with Terraform/Crossplane in the same Git repo.

Circular Dependencies

ArgoCD needs to run to deploy ArgoCD. The bootstrap problem.

Mitigation: Keep bootstrap scripts that work without ArgoCD. Once ArgoCD is running, it can manage itself.

Data Loss

GitOps recreates infrastructure, not data. If your database backup is stale, you lose data.

Mitigation: Separate, tested data backup strategy. Regular restore tests.

Making It Better: The Recovery Manifest

Create a single file that documents everything needed for recovery:

# recovery-manifest.yaml
recovery:
  cluster:
    provisioner: terraform
    path: terraform/clusters/production

  secrets:
    method: external-secrets
    vault-address: https://vault.internal.company.com
    vault-backup: s3://backups/vault-snapshots/

  gitops:
    repo: https://github.com/yourorg/gitops-repo.git
    path: clusters/production
    root-app: apps/root.yaml

  data:
    databases:
      - name: postgres-main
        backup-location: s3://backups/postgres/
        restore-command: |
          pg_restore -d main /backups/latest.dump
    volumes:
      - name: uploads
        backup-location: s3://backups/uploads/

  external-dependencies:
    - DNS zone: managed in Route53, terraform/dns/
    - Load balancer: created by terraform/clusters/
    - Vault: vault.internal.company.com (HA, separate infra)

  contacts:
    - oncall: ops-team@company.com
    - escalation: platform-lead@company.com

Store this in Git. Review it quarterly. Test it annually (at minimum).

The Mindset Shift

GitOps DR requires thinking differently:

  1. Clusters are cattle, not pets. If recovery is easy, you worry less about protecting individual clusters.

  2. State belongs in Git. Every time you kubectl apply without committing to Git, you’re creating DR debt.

  3. Data is separate. GitOps handles infrastructure. Data needs its own strategy.

  4. Practice recovery. The first time you recover shouldn’t be during an actual disaster.

My Recovery Checklist

## Pre-Disaster (do now)
- [ ] All manifests in Git
- [ ] Secrets strategy documented and tested
- [ ] Vault backed up and HA configured
- [ ] Recovery playbook written
- [ ] External dependencies documented
- [ ] Data backup strategy implemented
- [ ] Recovery tested in last 90 days

## During Recovery
- [ ] Provision new cluster
- [ ] Install ArgoCD
- [ ] Restore secrets decryption capability
- [ ] Connect ArgoCD to Git
- [ ] Apply root application
- [ ] Monitor sync progress
- [ ] Restore data from backups
- [ ] Update DNS/load balancers
- [ ] Verify application health
- [ ] Notify stakeholders

The best disaster recovery is the one you never need. The second best is the one you’ve practiced so many times that when disaster strikes, it’s just another Tuesday.