Every platform team eventually asks: should we build an Internal Developer Platform?

The answer is probably yes. The question is how.

I’ve seen platforms that cost millions and never got adopted. I’ve also seen scrappy internal tools that transformed developer productivity overnight. The difference isn’t budget or technology — it’s approach.

What Is an Internal Developer Platform?

An Internal Developer Platform (IDP) is a self-service layer that abstracts infrastructure complexity from developers. Instead of writing Kubernetes YAML, developers describe what they need. The platform handles how.

flowchart TD
    subgraph before["Without Platform"]
        D1["Developer"] --> K8s["Kubernetes YAML"]
        D1 --> CI["CI Pipeline"]
        D1 --> Sec["Security Config"]
        D1 --> Mon["Monitoring Setup"]
    end

    subgraph after["With Platform"]
        D2["Developer"] --> IDP["Platform API"]
        IDP --> K8s2["Kubernetes"]
        IDP --> CI2["CI/CD"]
        IDP --> Sec2["Security"]
        IDP --> Mon2["Monitoring"]
    end

The platform is the interface between developer intent and infrastructure reality.

Why Platforms Fail

Built in Isolation

Platform team builds what they think developers need. Developers don’t use it. Platform team blames developers for not understanding the vision.

Too Much Too Soon

Starting with a complete platform — service catalog, self-service everything, custom UI. Six months later, still not production-ready.

Wrong Abstraction Level

Platform is either too low-level (just Kubernetes with extra steps) or too high-level (no escape hatches when you need them).

No Migration Path

Existing services can’t migrate. Platform is only for greenfield. Team now maintains two systems forever.

Start With Problems, Not Solutions

Before building anything, understand:

  1. Where do developers lose time?

    • Waiting for infrastructure requests?
    • Writing boilerplate configuration?
    • Debugging deployment failures?
    • Getting production access?
  2. What do they keep asking for?

    • Check your ticketing system
    • Look at Slack questions
    • Talk to people (revolutionary, I know)
  3. Where do incidents originate?

    • Misconfigured services?
    • Missing security policies?
    • Incorrect resource limits?

The answers reveal what your platform should automate first.

The Minimal Viable Platform

Start with three components:

1. Golden Paths

Opinionated templates that encode best practices:

# service-template/
├── deployment.yaml      # Standard deployment pattern
├── service.yaml         # Service with sensible defaults
├── networkpolicy.yaml   # Security by default
├── servicemonitor.yaml  # Automatic monitoring
└── values.yaml          # Customization points

Developers get a working setup by default. They customize only what’s different about their service.

2. Self-Service Deployment

Push code, get deployment. No tickets, no waiting.

With GitOps:

# Application definition in Git
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-service
  namespace: argocd
spec:
  source:
    repoURL: https://gitlab.internal/my-service
    path: deploy
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: my-service
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Developer merges to main, ArgoCD deploys. No platform UI needed initially.

3. Guardrails

Kyverno policies that prevent mistakes:

# Require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-limits
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "CPU and memory limits are required"
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    cpu: "?*"
                    memory: "?*"

Developers can’t deploy without limits. The platform enforces standards automatically.

Building Blocks

Service Templates with Helm

Create a base chart that teams extend:

# base-service/Chart.yaml
apiVersion: v2
name: base-service
version: 1.0.0
description: Standard service template

# base-service/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Values.name }}
spec:
  replicas: {{ .Values.replicas | default 2 }}
  selector:
    matchLabels:
      app: {{ .Values.name }}
  template:
    metadata:
      labels:
        app: {{ .Values.name }}
      annotations:
        prometheus.io/scrape: "true"
    spec:
      containers:
        - name: {{ .Values.name }}
          image: {{ .Values.image }}
          ports:
            - containerPort: {{ .Values.port | default 8080 }}
          resources:
            requests:
              cpu: {{ .Values.resources.requests.cpu | default "100m" }}
              memory: {{ .Values.resources.requests.memory | default "128Mi" }}
            limits:
              cpu: {{ .Values.resources.limits.cpu | default "500m" }}
              memory: {{ .Values.resources.limits.memory | default "512Mi" }}
          readinessProbe:
            httpGet:
              path: {{ .Values.healthPath | default "/health" }}
              port: {{ .Values.port | default 8080 }}

Teams use it:

# my-service/values.yaml
name: my-service
image: registry.internal/my-service:v1.2.3
replicas: 3
port: 8080
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 1
    memory: 1Gi

Namespace as Boundary

Each team gets a namespace with everything pre-configured:

apiVersion: v1
kind: Namespace
metadata:
  name: team-payments
  labels:
    team: payments
    environment: production
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: team-payments
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Teams own their namespace. Platform provides the boundaries.

Self-Service via GitOps

Developers modify their namespace config via pull requests:

infrastructure/
├── teams/
│   ├── payments/
│   │   ├── namespace.yaml
│   │   ├── applications/
│   │   │   ├── api.yaml
│   │   │   └── worker.yaml
│   │   └── secrets/
│   │       └── external-secret.yaml
│   ├── orders/
│   └── ...

Pull request triggers review. Merge triggers ArgoCD sync. No tickets.

Progressive Enhancement

Phase 1: Templates and Guardrails

  • Helm charts for common patterns
  • Kyverno policies for safety
  • GitOps for deployment
  • Basic documentation

Outcome: Developers can deploy safely without understanding Kubernetes deeply.

Phase 2: Observability Integration

Outcome: Developers get visibility without configuration.

Phase 3: Developer Portal

Add a UI layer (Backstage, custom, or similar):

flowchart TD
    Portal["Developer Portal"] --> Catalog["Service Catalog"]
    Portal --> Templates["Create from Template"]
    Portal --> Docs["Documentation"]
    Portal --> Status["Service Status"]

    Templates --> Git["Git Repository"]
    Git --> ArgoCD["ArgoCD"]
    ArgoCD --> K8s["Kubernetes"]

Outcome: Developers have a single entry point.

Phase 4: Advanced Capabilities

  • Secret management integration (Vault)
  • Database provisioning
  • Environment cloning
  • Cost visibility

Outcome: True self-service for most needs.

What Not to Build

Custom YAML DSL

Don’t invent a configuration language. Use existing tools (Helm, Kustomize, cdk8s).

Ticket-Based Workflows

If developers still raise tickets, you haven’t automated enough.

One-Size-Fits-All

Provide golden paths, but allow deviation. Expert teams should be able to go lower-level.

Feature Factory

Platform should be stable. Continuous feature churn means you’re solving wrong problems.

Measuring Success

Time to Production

How long from “new service idea” to “running in production”?

  • Weeks → Platform not working
  • Days → Getting there
  • Hours → Success

Ticket Volume

Platform requests should decrease over time:

Before: 50 infrastructure tickets/week
After: 10 infrastructure tickets/week (edge cases only)

Developer NPS

Ask developers: “Would you recommend this platform to a colleague?”

Incident Correlation

Do platform-deployed services have fewer incidents than manually configured ones?

Team Structure

Platform Team Size

Rule of thumb: 1 platform engineer per 10-15 application developers.

Too small → Platform doesn’t evolve Too large → Platform team builds features nobody needs

Responsibilities

Platform team:

  • Maintains templates and tools
  • Writes policies
  • Handles platform incidents
  • Provides migration support

Application teams:

  • Own their services
  • Deploy their code
  • Define their resource needs
  • First responders for service incidents

Embedded vs Centralized

Start centralized. As platform matures, embed platform engineers in product teams part-time to understand real needs.

Real-World Example

My homelab platform uses:

Golden path:

# Standard service template
dependencies:
  - base-service  # Helm dependency
values:
  name: my-app
  image: registry/my-app:latest

Self-service:

  • Push to main → ArgoCD deploys
  • PR for namespace changes → Auto-merged after review

Guardrails:

  • Kyverno enforces limits, labels, security context
  • NetworkPolicy default deny

Observability:

  • ServiceMonitor auto-created
  • Grafana dashboard generated

Total custom code: ~500 lines of Helm templates and Kyverno policies. Everything else is configuration of existing tools.

Common Mistakes

Starting with the Portal

Don’t build UI until workflows are proven. Test with GitOps and CLI first.

Ignoring Existing Services

Platform must support migration, not just greenfield. Otherwise, you maintain two systems.

Over-Engineering Security

Don’t block every edge case. Start permissive, tighten based on incidents.

Hiding All Complexity

Some developers want to understand. Provide escape hatches and documentation.

Why This Matters

Developer time is expensive. Every hour spent fighting infrastructure is an hour not building product.

A good platform multiplies productivity:

  • Junior developers deploy safely on day one
  • Senior developers focus on hard problems
  • Operations burden shifts from repetitive to interesting

Start small. Solve real problems. Grow based on feedback.

The best platform is invisible — developers just think “deploying is easy here.”


A platform is not a product. It’s the absence of friction.