Your cluster is gone. Complete failure. The cloud region is down, the hardware died, or someone ran the wrong terraform destroy. Everything is gone.
Now what?
If you’ve been doing GitOps right, the answer is: spin up a new cluster, point ArgoCD at Git, wait. Your entire infrastructure recreates itself.
This is the ultimate promise of GitOps: Git is your backup.
Why GitOps Changes Disaster Recovery
Traditional DR involves:
- Regular backups of cluster state
- Backup storage (etcd snapshots, Velero backups)
- Tested restore procedures
- Recovery time measured in hours
GitOps DR is different:
- Cluster state IS the backup (it’s in Git)
- No separate backup infrastructure needed
- Recovery = bootstrap + sync
- Recovery time measured in minutes
The difference comes from treating cluster state as derived, not primary. The primary data is in Git. Clusters are disposable.
flowchart TD
subgraph traditional["Traditional"]
TC["Cluster<br/>(primary)"] -->|backup| TB["Backup<br/>(secondary)"]
end
subgraph gitops["GitOps"]
GG["Git<br/>(primary)"] -->|sync| GC["Cluster<br/>(derived)"]
end
What You Need for GitOps DR
For this to work, everything must be in Git:
1. Application Manifests
All your workload definitions:
- Deployments, Services, ConfigMaps
- Helm charts with values files
- Kustomize overlays
This is the easy part. Most teams already have this.
2. Infrastructure Configuration
Cluster-level resources:
- Namespaces
- RBAC (Roles, RoleBindings, ServiceAccounts)
- NetworkPolicies
- ResourceQuotas
- PodSecurityPolicies / Pod Security Standards
Often forgotten, but critical for security and isolation.
3. Platform Components
The tools your applications depend on:
- Ingress controllers
- Cert-manager
- Monitoring stack (Prometheus, Grafana)
- Logging (Loki, Fluentd)
- Service mesh (if used)
4. Secrets Strategy
Secrets are special — you can’t store plain secrets in Git. Options:
- Sealed Secrets: Encrypted in Git, decrypted in cluster
- External Secrets: Fetch from Vault/AWS Secrets Manager
- SOPS: Encrypt files with keys you control
The important thing: secrets must be reproducible, not stored only in the cluster.
5. Persistent Data
Here’s the hard truth: GitOps doesn’t back up data.
Databases, file storage, stateful workloads — these need separate backup strategies. GitOps recreates the infrastructure that runs your data, not the data itself.
The Recovery Process
Let’s walk through recovering a complete cluster failure.
Step 0: Have a Recovery Playbook
Before disaster strikes, document:
- How to provision a new cluster
- Bootstrap commands for ArgoCD
- Vault access and backup location
- External dependencies (DNS, load balancers)
Test this regularly. A playbook you’ve never run is fiction.
Step 1: Provision New Cluster
Using whatever method you prefer:
# Terraform
cd terraform/clusters/production
terraform apply
# Or managed Kubernetes
gcloud container clusters create production-recovery \
--region europe-west1 \
--num-nodes 3
# Or bare metal
talosctl apply-config --nodes 10.0.0.10 --file controlplane.yaml
You need a functional Kubernetes cluster. How you get there is separate from GitOps.
Step 2: Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Wait for it to be ready
kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s
Step 3: Restore Secrets Access
Configure External Secrets to connect to Vault:
# Configure Vault authentication for the new cluster
kubectl create secret generic vault-token \
-n external-secrets \
--from-literal=token=$VAULT_TOKEN
Step 4: Connect ArgoCD to Git
# Add your GitOps repository
argocd repo add https://github.com/yourorg/gitops-repo.git \
--username git \
--password $GIT_TOKEN
# Or via Secret
kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: gitops-repo
namespace: argocd
labels:
argocd.argoproj.io/secret-type: repository
stringData:
url: https://github.com/yourorg/gitops-repo.git
username: git
password: ${GIT_TOKEN}
EOF
Step 5: Apply the Root Application
If you use App-of-Apps pattern (you should):
kubectl apply -f - <<EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/yourorg/gitops-repo.git
targetRevision: main
path: clusters/production
destination:
server: https://kubernetes.default.svc
syncPolicy:
automated:
prune: true
selfHeal: true
EOF
Step 6: Wait
ArgoCD now reconciles everything:
- Discovers all applications in the root app
- Creates namespaces and RBAC
- Deploys platform components
- Deploys your applications
Watch progress:
argocd app list
argocd app get root --refresh
Or in the UI: kubectl port-forward svc/argocd-server -n argocd 8080:443
Step 7: Restore Data
This is where your data backup strategy comes in:
# Restore database
pg_restore -d myapp < /backups/myapp-latest.dump
# Or Velero restore
velero restore create --from-backup production-daily-latest
Step 8: Update DNS / Load Balancers
Point external traffic to the new cluster:
- Update DNS records
- Reconfigure CDN origins
- Update load balancer targets
Recovery Time Objective
How fast can you recover? In my experience:
| Phase | Time |
|---|---|
| Cluster provisioning | 5-15 min |
| ArgoCD install | 2-3 min |
| Secrets setup | 1-2 min |
| GitOps sync (small cluster) | 5-10 min |
| GitOps sync (large cluster) | 15-30 min |
| Data restore | Varies wildly |
| DNS propagation | 5-60 min |
Total for infrastructure: 20-60 minutes
Data restore depends on your data volume and backup strategy. This is usually the longest part.
What Can Go Wrong
Missing Secrets
If Vault becomes inaccessible, External Secrets can’t fetch secrets. Your workloads won’t start.
Mitigation: Run Vault in HA mode on separate infrastructure. Backup Vault regularly. Consider Vault replication to a second location.
External Dependencies
Your cluster might depend on things not in Git:
- Cloud IAM roles
- DNS zones
- SSL certificates from Let’s Encrypt
- External databases
Mitigation: Document all external dependencies. Better: manage them with Terraform/Crossplane in the same Git repo.
Circular Dependencies
ArgoCD needs to run to deploy ArgoCD. The bootstrap problem.
Mitigation: Keep bootstrap scripts that work without ArgoCD. Once ArgoCD is running, it can manage itself.
Data Loss
GitOps recreates infrastructure, not data. If your database backup is stale, you lose data.
Mitigation: Separate, tested data backup strategy. Regular restore tests.
Making It Better: The Recovery Manifest
Create a single file that documents everything needed for recovery:
# recovery-manifest.yaml
recovery:
cluster:
provisioner: terraform
path: terraform/clusters/production
secrets:
method: external-secrets
vault-address: https://vault.internal.company.com
vault-backup: s3://backups/vault-snapshots/
gitops:
repo: https://github.com/yourorg/gitops-repo.git
path: clusters/production
root-app: apps/root.yaml
data:
databases:
- name: postgres-main
backup-location: s3://backups/postgres/
restore-command: |
pg_restore -d main /backups/latest.dump
volumes:
- name: uploads
backup-location: s3://backups/uploads/
external-dependencies:
- DNS zone: managed in Route53, terraform/dns/
- Load balancer: created by terraform/clusters/
- Vault: vault.internal.company.com (HA, separate infra)
contacts:
- oncall: ops-team@company.com
- escalation: platform-lead@company.com
Store this in Git. Review it quarterly. Test it annually (at minimum).
The Mindset Shift
GitOps DR requires thinking differently:
Clusters are cattle, not pets. If recovery is easy, you worry less about protecting individual clusters.
State belongs in Git. Every time you kubectl apply without committing to Git, you’re creating DR debt.
Data is separate. GitOps handles infrastructure. Data needs its own strategy.
Practice recovery. The first time you recover shouldn’t be during an actual disaster.
My Recovery Checklist
## Pre-Disaster (do now)
- [ ] All manifests in Git
- [ ] Secrets strategy documented and tested
- [ ] Vault backed up and HA configured
- [ ] Recovery playbook written
- [ ] External dependencies documented
- [ ] Data backup strategy implemented
- [ ] Recovery tested in last 90 days
## During Recovery
- [ ] Provision new cluster
- [ ] Install ArgoCD
- [ ] Restore secrets decryption capability
- [ ] Connect ArgoCD to Git
- [ ] Apply root application
- [ ] Monitor sync progress
- [ ] Restore data from backups
- [ ] Update DNS/load balancers
- [ ] Verify application health
- [ ] Notify stakeholders
The best disaster recovery is the one you never need. The second best is the one you’ve practiced so many times that when disaster strikes, it’s just another Tuesday.
