Every platform team eventually asks: should we build an Internal Developer Platform?
The answer is probably yes. The question is how.
I’ve seen platforms that cost millions and never got adopted. I’ve also seen scrappy internal tools that transformed developer productivity overnight. The difference isn’t budget or technology — it’s approach.
What Is an Internal Developer Platform?
An Internal Developer Platform (IDP) is a self-service layer that abstracts infrastructure complexity from developers. Instead of writing Kubernetes YAML, developers describe what they need. The platform handles how.
flowchart TD
subgraph before["Without Platform"]
D1["Developer"] --> K8s["Kubernetes YAML"]
D1 --> CI["CI Pipeline"]
D1 --> Sec["Security Config"]
D1 --> Mon["Monitoring Setup"]
end
subgraph after["With Platform"]
D2["Developer"] --> IDP["Platform API"]
IDP --> K8s2["Kubernetes"]
IDP --> CI2["CI/CD"]
IDP --> Sec2["Security"]
IDP --> Mon2["Monitoring"]
end
The platform is the interface between developer intent and infrastructure reality.
Why Platforms Fail
Built in Isolation
Platform team builds what they think developers need. Developers don’t use it. Platform team blames developers for not understanding the vision.
Too Much Too Soon
Starting with a complete platform — service catalog, self-service everything, custom UI. Six months later, still not production-ready.
Wrong Abstraction Level
Platform is either too low-level (just Kubernetes with extra steps) or too high-level (no escape hatches when you need them).
No Migration Path
Existing services can’t migrate. Platform is only for greenfield. Team now maintains two systems forever.
Start With Problems, Not Solutions
Before building anything, understand:
Where do developers lose time?
- Waiting for infrastructure requests?
- Writing boilerplate configuration?
- Debugging deployment failures?
- Getting production access?
What do they keep asking for?
- Check your ticketing system
- Look at Slack questions
- Talk to people (revolutionary, I know)
Where do incidents originate?
- Misconfigured services?
- Missing security policies?
- Incorrect resource limits?
The answers reveal what your platform should automate first.
The Minimal Viable Platform
Start with three components:
1. Golden Paths
Opinionated templates that encode best practices:
# service-template/
├── deployment.yaml # Standard deployment pattern
├── service.yaml # Service with sensible defaults
├── networkpolicy.yaml # Security by default
├── servicemonitor.yaml # Automatic monitoring
└── values.yaml # Customization points
Developers get a working setup by default. They customize only what’s different about their service.
2. Self-Service Deployment
Push code, get deployment. No tickets, no waiting.
With GitOps:
# Application definition in Git
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-service
namespace: argocd
spec:
source:
repoURL: https://gitlab.internal/my-service
path: deploy
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: my-service
syncPolicy:
automated:
prune: true
selfHeal: true
Developer merges to main, ArgoCD deploys. No platform UI needed initially.
3. Guardrails
Kyverno policies that prevent mistakes:
# Require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-limits
spec:
validationFailureAction: Enforce
rules:
- name: require-limits
match:
resources:
kinds:
- Pod
validate:
message: "CPU and memory limits are required"
pattern:
spec:
containers:
- resources:
limits:
cpu: "?*"
memory: "?*"
Developers can’t deploy without limits. The platform enforces standards automatically.
Building Blocks
Service Templates with Helm
Create a base chart that teams extend:
# base-service/Chart.yaml
apiVersion: v2
name: base-service
version: 1.0.0
description: Standard service template
# base-service/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Values.name }}
spec:
replicas: {{ .Values.replicas | default 2 }}
selector:
matchLabels:
app: {{ .Values.name }}
template:
metadata:
labels:
app: {{ .Values.name }}
annotations:
prometheus.io/scrape: "true"
spec:
containers:
- name: {{ .Values.name }}
image: {{ .Values.image }}
ports:
- containerPort: {{ .Values.port | default 8080 }}
resources:
requests:
cpu: {{ .Values.resources.requests.cpu | default "100m" }}
memory: {{ .Values.resources.requests.memory | default "128Mi" }}
limits:
cpu: {{ .Values.resources.limits.cpu | default "500m" }}
memory: {{ .Values.resources.limits.memory | default "512Mi" }}
readinessProbe:
httpGet:
path: {{ .Values.healthPath | default "/health" }}
port: {{ .Values.port | default 8080 }}
Teams use it:
# my-service/values.yaml
name: my-service
image: registry.internal/my-service:v1.2.3
replicas: 3
port: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 1
memory: 1Gi
Namespace as Boundary
Each team gets a namespace with everything pre-configured:
apiVersion: v1
kind: Namespace
metadata:
name: team-payments
labels:
team: payments
environment: production
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-payments
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: team-payments
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Teams own their namespace. Platform provides the boundaries.
Self-Service via GitOps
Developers modify their namespace config via pull requests:
infrastructure/
├── teams/
│ ├── payments/
│ │ ├── namespace.yaml
│ │ ├── applications/
│ │ │ ├── api.yaml
│ │ │ └── worker.yaml
│ │ └── secrets/
│ │ └── external-secret.yaml
│ ├── orders/
│ └── ...
Pull request triggers review. Merge triggers ArgoCD sync. No tickets.
Progressive Enhancement
Phase 1: Templates and Guardrails
- Helm charts for common patterns
- Kyverno policies for safety
- GitOps for deployment
- Basic documentation
Outcome: Developers can deploy safely without understanding Kubernetes deeply.
Phase 2: Observability Integration
Outcome: Developers get visibility without configuration.
Phase 3: Developer Portal
Add a UI layer (Backstage, custom, or similar):
flowchart TD
Portal["Developer Portal"] --> Catalog["Service Catalog"]
Portal --> Templates["Create from Template"]
Portal --> Docs["Documentation"]
Portal --> Status["Service Status"]
Templates --> Git["Git Repository"]
Git --> ArgoCD["ArgoCD"]
ArgoCD --> K8s["Kubernetes"]
Outcome: Developers have a single entry point.
Phase 4: Advanced Capabilities
- Secret management integration (Vault)
- Database provisioning
- Environment cloning
- Cost visibility
Outcome: True self-service for most needs.
What Not to Build
Custom YAML DSL
Don’t invent a configuration language. Use existing tools (Helm, Kustomize, cdk8s).
Ticket-Based Workflows
If developers still raise tickets, you haven’t automated enough.
One-Size-Fits-All
Provide golden paths, but allow deviation. Expert teams should be able to go lower-level.
Feature Factory
Platform should be stable. Continuous feature churn means you’re solving wrong problems.
Measuring Success
Time to Production
How long from “new service idea” to “running in production”?
- Weeks → Platform not working
- Days → Getting there
- Hours → Success
Ticket Volume
Platform requests should decrease over time:
Before: 50 infrastructure tickets/week
After: 10 infrastructure tickets/week (edge cases only)
Developer NPS
Ask developers: “Would you recommend this platform to a colleague?”
Incident Correlation
Do platform-deployed services have fewer incidents than manually configured ones?
Team Structure
Platform Team Size
Rule of thumb: 1 platform engineer per 10-15 application developers.
Too small → Platform doesn’t evolve Too large → Platform team builds features nobody needs
Responsibilities
Platform team:
- Maintains templates and tools
- Writes policies
- Handles platform incidents
- Provides migration support
Application teams:
- Own their services
- Deploy their code
- Define their resource needs
- First responders for service incidents
Embedded vs Centralized
Start centralized. As platform matures, embed platform engineers in product teams part-time to understand real needs.
Real-World Example
My homelab platform uses:
Golden path:
# Standard service template
dependencies:
- base-service # Helm dependency
values:
name: my-app
image: registry/my-app:latest
Self-service:
- Push to main → ArgoCD deploys
- PR for namespace changes → Auto-merged after review
Guardrails:
- Kyverno enforces limits, labels, security context
- NetworkPolicy default deny
Observability:
- ServiceMonitor auto-created
- Grafana dashboard generated
Total custom code: ~500 lines of Helm templates and Kyverno policies. Everything else is configuration of existing tools.
Common Mistakes
Starting with the Portal
Don’t build UI until workflows are proven. Test with GitOps and CLI first.
Ignoring Existing Services
Platform must support migration, not just greenfield. Otherwise, you maintain two systems.
Over-Engineering Security
Don’t block every edge case. Start permissive, tighten based on incidents.
Hiding All Complexity
Some developers want to understand. Provide escape hatches and documentation.
Why This Matters
Developer time is expensive. Every hour spent fighting infrastructure is an hour not building product.
A good platform multiplies productivity:
- Junior developers deploy safely on day one
- Senior developers focus on hard problems
- Operations burden shifts from repetitive to interesting
Start small. Solve real problems. Grow based on feedback.
The best platform is invisible — developers just think “deploying is easy here.”
A platform is not a product. It’s the absence of friction.
