GitOps promises that Git is the source of truth. But what if someone kubectl edits a deployment? What if a mutating webhook changes a resource? What if the cluster silently diverges from what Git says it should be?
This is configuration drift, and it’s one of the most insidious problems in Kubernetes operations. ArgoCD can help you detect it — if you configure it correctly.
What Is Configuration Drift?
Drift happens when the actual state of your cluster differs from the desired state in Git.
flowchart LR
subgraph git["Git says (Source of truth)"]
G1["replicas: 3"]
G2["image: v1.2.3"]
G3["cpu: 100m"]
end
subgraph cluster["Cluster has (Actual state)"]
C1["replicas: 5"]
C2["image: v1.2.3"]
C3["cpu: 200m"]
end
git -.->|"≠"| cluster
How did replicas become 5 when Git says 3? Possible causes:
- Manual changes: Someone ran
kubectl scaleorkubectl edit - Horizontal Pod Autoscaler: HPA adjusted replicas
- Mutating webhooks: Admission controllers modified resources
- Controller side effects: Operators made changes
- Partial syncs: Sync failed midway
Some drift is intentional (HPA). Most is not. The problem is not knowing which is which.
Why Drift Matters
Without drift detection, you have no guarantee that Git represents reality. This breaks:
- Audit trails: “What’s deployed?” becomes “check the cluster” instead of “check Git”
- Disaster recovery: Rebuilding from Git won’t match the old state
- Security: Unauthorized changes go unnoticed
- Reproducibility: Two clusters from the same Git won’t be identical
The moment you have undetected drift, you’ve lost the core benefit of GitOps.
ArgoCD’s Sync Status
ArgoCD continuously compares Git to cluster state. The sync status tells you:
- Synced: Cluster matches Git exactly
- OutOfSync: Differences detected
- Unknown: ArgoCD can’t determine state
| Application | Sync Status | Health |
|---|---|---|
| frontend | Synced | Healthy |
| backend | OutOfSync | Healthy |
| database | Synced | Healthy |
| cache | Synced | Degraded |
“OutOfSync” = drift detected
OutOfSync means drift. But ArgoCD’s default behavior might surprise you.
Self-Heal: Automatic Drift Correction
ArgoCD can automatically revert drift:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: frontend
spec:
syncPolicy:
automated:
selfHeal: true # Revert manual changes
prune: true # Delete orphaned resources
With selfHeal: true, when someone runs kubectl scale deployment frontend --replicas=5, ArgoCD will revert it to what Git says within seconds.
This is powerful but has implications:
- Intentional changes get reverted
- HPA adjustments get overwritten
- You can’t quickly hotfix production
For most applications, selfHeal should be enabled. It’s the “GitOps purist” approach.
Handling Intentional Drift: Ignore Differences
Some fields should be managed outside Git:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: autoscaled-app
spec:
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
- group: ""
kind: Service
jsonPointers:
- /spec/clusterIP
This tells ArgoCD: “Don’t report drift for these fields.”
Common fields to ignore:
/spec/replicas(if using HPA)/spec/clusterIP(assigned by Kubernetes)/metadata/annotations(controller-added)/status(always managed by controllers)
Detecting Drift Without Auto-Fix
Sometimes you want to know about drift but not automatically fix it:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: critical-app
spec:
syncPolicy:
automated:
selfHeal: false # Don't auto-fix
prune: false # Don't auto-delete
Now ArgoCD shows OutOfSync status but waits for manual intervention. This is useful for:
- Critical production systems where you want human review
- Debugging drift sources
- Applications managed partially outside GitOps
Notifications: Alert on Drift
Don’t stare at the dashboard. Get notified:
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-notifications-cm
namespace: argocd
data:
trigger.on-sync-status-unknown: |
- when: app.status.sync.status == 'OutOfSync'
send: [app-out-of-sync]
template.app-out-of-sync: |
message: |
Application {{.app.metadata.name}} is OutOfSync.
Sync Status: {{.app.status.sync.status}}
Health: {{.app.status.health.status}}
Repository: {{.app.spec.source.repoURL}}
Connect this to Slack, PagerDuty, or email. Drift should trigger alerts.
The Diff View: Understanding Drift
When drift occurs, ArgoCD shows exactly what changed:
argocd app diff frontend
Or in the UI, click on an OutOfSync application to see the diff:
--- Git (desired)
+++ Cluster (actual)
@@ -1,4 +1,4 @@
spec:
- replicas: 3
+ replicas: 5
template:
spec:
This is invaluable for understanding what drifted and why.
Refresh vs Sync
Two different operations:
Refresh: Compare Git to cluster, update status. No changes made.
argocd app get frontend --refresh
Sync: Apply Git state to cluster. Changes made.
argocd app sync frontend
Refresh is safe and frequent (every 3 minutes by default). Sync is destructive and should be deliberate (unless automated).
Drift Detection Strategy
Here’s my approach:
For Development/Staging
selfHeal: true— Revert all driftprune: true— Delete orphaned resources- Fast feedback, pure GitOps
For Production (Most Apps)
selfHeal: true— Revert driftprune: true— Delete orphaned- Alerts on any OutOfSync event
- Investigate why drift happened
For Production (Critical/Special)
selfHeal: false— Human review requiredprune: false— Manual deletion only- Strict alerts
- Explicit sync approval
For HPA-Managed Apps
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
syncPolicy:
automated:
selfHeal: true # For other fields
Finding the Drift Source
When you see drift, investigate:
Check audit logs: Who ran kubectl?
kubectl get events --field-selector reason=UpdateCheck controller logs: Did an operator make changes?
Check admission webhooks: Are mutations happening?
kubectl get mutatingwebhookconfigurationsCheck the diff: What exactly changed?
argocd app diff app-name
Preventing Drift at the Source
Better than detecting drift is preventing it:
RBAC restrictions: Limit who can modify resources
apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: readonly rules: - apiGroups: ["*"] resources: ["*"] verbs: ["get", "list", "watch"] # No create/update/deletePolicy enforcement: Use Kyverno to block manual changes
apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-gitops spec: rules: - name: block-manual-changes match: resources: kinds: - Deployment exclude: subjects: - kind: ServiceAccount name: argocd-application-controller validate: message: "Changes must go through GitOps" deny: {}Training: Teach teams to change Git, not cluster
Monitoring Drift Over Time
Track drift as a metric:
# Prometheus query for out-of-sync apps
count(argocd_app_info{sync_status="OutOfSync"})
Alert if it’s non-zero for too long:
- alert: GitOpsDriftDetected
expr: count(argocd_app_info{sync_status="OutOfSync"}) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "GitOps drift detected"
description: "One or more applications are OutOfSync with Git"
My Checklist for Drift-Free GitOps
[ ] selfHeal enabled for most applications
[ ] ignoreDifferences configured for HPA-managed replicas
[ ] Notifications set up for OutOfSync events
[ ] RBAC restricts direct cluster modifications
[ ] Policy enforcement prevents manual changes
[ ] Monitoring alerts on drift
[ ] Team trained on GitOps workflow
Configuration drift is the enemy of reliable infrastructure. Detect it immediately, fix it automatically where safe, and investigate ruthlessly when it happens. Git should always reflect reality — that’s the whole point.
