Building an Internal Developer Platform: Where to Start

Every platform team eventually asks the same question: should we build an Internal Developer Platform? The honest answer is usually yes. The part that wrecks teams is the how.

I’ve watched platforms that cost a small fortune get shipped and then quietly abandoned because nobody wanted to use them. I’ve also seen a couple of Helm charts and a Kyverno policy change how a whole team ships software. The gap between those two outcomes has almost nothing to do with budget or which fashionable tool you picked. It comes down to whether you started by solving a real problem or by building the platform you imagined developers should want.

So instead of giving you a reference architecture to copy, I want to walk this up from the smallest useful thing. You should be able to stop reading at any point and still have something worth deploying.

What an IDP Actually Is

An Internal Developer Platform is a self-service layer that hides infrastructure plumbing from the people building applications. Developers describe what they need. The platform figures out how to make it real. That’s the whole pitch.

flowchart TD
    subgraph before["Without Platform"]
        D1["Developer"] --> K8s["Kubernetes YAML"]
        D1 --> CI["CI Pipeline"]
        D1 --> Sec["Security Config"]
        D1 --> Mon["Monitoring Setup"]
    end

    subgraph after["With Platform"]
        D2["Developer"] --> IDP["Platform API"]
        IDP --> K8s2["Kubernetes"]
        IDP --> CI2["CI/CD"]
        IDP --> Sec2["Security"]
        IDP --> Mon2["Monitoring"]
    end

The platform sits between what a developer wants and what the cluster requires. Everything in this post is about making that layer thin, honest, and easy to get out of when you need to.

Why So Many Platforms Die

Before the build steps, it’s worth knowing the failure modes, because most of them are predictable and you can design around them.

The classic one is building in isolation. The platform team locks themselves in a room, ships what they think people need, and then gets annoyed when nobody adopts it. The developers weren’t being difficult. They just got handed a tool that solved problems they didn’t have.

Close behind is doing too much too soon. A service catalog, a custom UI, self-service for every conceivable resource, all planned up front. Six months later there’s a impressive demo and nothing in production.

Then there’s picking the wrong altitude. Go too low and the platform is Kubernetes with extra steps, so why bother. Go too high and there’s no escape hatch the day someone hits a case the abstraction never anticipated, which they always will.

And the quiet killer: no migration path. The platform only works for greenfield, your existing services can’t move onto it, and now you run two systems forever. That last one has sunk more platform efforts than any technical mistake.

Start With Problems, Not a Solution

This is the part people skip, and it’s the part that decides everything. Before you write a single template, go find out where the pain actually is.

Where do developers lose time? Waiting on infrastructure requests, copy-pasting the same boilerplate config, debugging deployments that fail for reasons nobody documented, hunting down production access. What do they keep asking for? Your ticketing system knows. Your Slack channels know. And, revolutionary idea, you can ask them.

Then look at where incidents come from. Misconfigured services, missing network policies, resource limits nobody set. Those are the things your platform should automate away first, because automating them pays for itself in fewer 3am pages.

The answers point straight at what to build. If you build anything before you have them, you’re guessing.

The Simplest Useful Platform

Here’s the minimum that’s worth deploying. Three pieces, all built from tools you already run.

Golden paths

Opinionated templates that bake in the right defaults so nobody has to remember them.

# service-template/
├── deployment.yaml      # Standard deployment pattern
├── service.yaml         # Service with sensible defaults
├── networkpolicy.yaml   # Security by default
├── servicemonitor.yaml  # Automatic monitoring
└── values.yaml          # Customization points

A developer gets a working, monitored, network-policied service out of the box and only touches the values that make their service different. The good defaults are the path of least resistance, which is exactly where you want them.

Self-service deployment

Push code, get a deployment. No ticket, no waiting on someone in another timezone.

With GitOps:

# Application definition in Git
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-service
  namespace: argocd
spec:
  source:
    repoURL: https://gitlab.internal/my-service
    path: deploy
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: my-service
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Merge to main, ArgoCD picks it up, the service ships. You don’t need a fancy portal for this to feel like self-service. Git is the interface.

Guardrails

The flip side of self-service is that people will deploy things that hurt them. Kyverno policies catch the obvious mistakes before they land:

# Require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-limits
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "CPU and memory limits are required"
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    cpu: "?*"
                    memory: "?*"

A pod without limits never makes it into the cluster. Nobody has to remember the rule because the platform won’t let them forget it.

That’s a real platform. Templates, GitOps, guardrails. If you ship only this, developers can deploy safely without learning the deep internals of Kubernetes, and you’ve solved the most common tickets. You could stop here for a year and be fine.

Layer One: The Building Blocks

Once the basics stick, you flesh out the pieces. None of this is exotic, it’s mostly configuration of tools you already have.

Service templates with Helm

A base chart teams extend instead of forking:

# base-service/Chart.yaml
apiVersion: v2
name: base-service
version: 1.0.0
description: Standard service template

# base-service/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Values.name }}
spec:
  replicas: {{ .Values.replicas | default 2 }}
  selector:
    matchLabels:
      app: {{ .Values.name }}
  template:
    metadata:
      labels:
        app: {{ .Values.name }}
      annotations:
        prometheus.io/scrape: "true"
    spec:
      containers:
        - name: {{ .Values.name }}
          image: {{ .Values.image }}
          ports:
            - containerPort: {{ .Values.port | default 8080 }}
          resources:
            requests:
              cpu: {{ .Values.resources.requests.cpu | default "100m" }}
              memory: {{ .Values.resources.requests.memory | default "128Mi" }}
            limits:
              cpu: {{ .Values.resources.limits.cpu | default "500m" }}
              memory: {{ .Values.resources.limits.memory | default "512Mi" }}
          readinessProbe:
            httpGet:
              path: {{ .Values.healthPath | default "/health" }}
              port: {{ .Values.port | default 8080 }}

A team consumes it with a tiny values file:

# my-service/values.yaml
name: my-service
image: registry.internal/my-service:v1.2.3
replicas: 3
port: 8080
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 1
    memory: 1Gi

When you fix a bug in the base chart, every service inherits the fix on its next deploy. That’s the leverage you’re after.

The namespace as a boundary

Give each team a namespace that arrives with its guardrails already in place:

apiVersion: v1
kind: Namespace
metadata:
  name: team-payments
  labels:
    team: payments
    environment: production
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: team-payments
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

The team owns what happens inside the namespace. The platform owns the walls around it. A runaway workload in payments can’t eat the cluster, and a default-deny policy means traffic is closed until someone deliberately opens it.

Self-service through pull requests

Teams change their own config the same way they change code:

infrastructure/
├── teams/
│   ├── payments/
│   │   ├── namespace.yaml
│   │   ├── applications/
│   │   │   ├── api.yaml
│   │   │   └── worker.yaml
│   │   └── secrets/
│   │       └── external-secret.yaml
│   ├── orders/
│   └── ...

A pull request triggers review, the merge triggers an ArgoCD sync, and the change is live. Every infrastructure change has an author, a reviewer, and a diff. No tickets, and a full audit trail you got for free.

Layer Two: Growing the Platform

You don’t bolt all of this on at once. Each phase is a place you can comfortably stop and live for a while.

Phase one is what we just built: Helm charts for the common patterns, Kyverno policies for safety, GitOps for deployment, and enough docs to find your way around. People deploy safely without a deep Kubernetes background.

Phase two is observability that nobody has to wire up by hand:

A service deployed through the platform shows up in dashboards with no extra config. When it breaks, the data is already there.

Phase three is when a UI starts to earn its keep, whether that’s Backstage or something you roll yourself:

flowchart TD
    Portal["Developer Portal"] --> Catalog["Service Catalog"]
    Portal --> Templates["Create from Template"]
    Portal --> Docs["Documentation"]
    Portal --> Status["Service Status"]

    Templates --> Git["Git Repository"]
    Git --> ArgoCD["ArgoCD"]
    ArgoCD --> K8s["Kubernetes"]

Notice the portal doesn’t replace any of the earlier layers. It generates Git commits that flow through the same ArgoCD you already trust. It’s a friendlier front door onto machinery that already works.

Phase four is the deeper self-service most needs eventually reach: secret management with Vault, database provisioning, environment cloning, cost visibility. By now the platform handles the long tail of requests that used to land in your queue.

What Not to Build

Just as useful as knowing what to build is knowing what to refuse.

Don’t invent your own YAML DSL. Someone always wants to, and the result is a configuration language only your team understands, with worse tooling than Helm, Kustomize, or cdk8s already give you for free. Resist it.

If developers are still raising tickets to get things done, the automation isn’t finished. A ticket is a sign of a workflow you haven’t built yet.

Resist one-size-fits-all. Golden paths are defaults, not laws. The team that knows exactly what they’re doing should be able to drop down to raw manifests without asking permission.

And keep the platform boring on purpose. A platform that’s constantly shipping new features is usually a platform that hasn’t found the right problems. Stability is a feature here.

Knowing If It’s Working

You’ll want to know whether any of this paid off. A few signals are worth watching.

Time to production is the big one. How long from “new service idea” to “running in prod”? Weeks means the platform isn’t working. Days means you’re on the right track. Hours means you’ve got it.

Ticket volume should trend down as the platform absorbs the routine work:

Before: 50 infrastructure tickets/week
After: 10 infrastructure tickets/week (edge cases only)

Ask developers a blunt question now and then: would you recommend this platform to a colleague? The answer tells you more than any dashboard. And check whether services deployed through the platform actually have fewer incidents than the hand-rolled ones. If the guardrails work, they should.

The Team Behind It

A platform is a product with users, which means it needs people who own it. A rough rule is one platform engineer for every ten to fifteen application developers. Too few and the platform stops evolving. Too many and they start inventing features nobody asked for, which is its own failure mode.

The split of responsibility matters. The platform team maintains the templates and tools, writes the policies, handles platform-level incidents, and helps teams migrate onto the thing. Application teams own their services end to end: they deploy their own code, define their own resource needs, and are first responders when their service breaks.

Start centralized so the platform has a clear owner. As it matures, embed platform engineers part-time into product teams. There’s no better way to find out what’s actually painful than sitting next to the people living with it.

What This Looks Like in Practice

My homelab platform runs on exactly the pieces above.

The golden path is a Helm dependency:

# Standard service template
dependencies:
  - base-service  # Helm dependency
values:
  name: my-app
  image: registry/my-app:latest

Self-service is GitOps: push to main and ArgoCD deploys, open a PR for namespace changes and it auto-merges after review. Guardrails are Kyverno enforcing limits, labels, and security context, plus a default-deny NetworkPolicy. Observability comes free, with a ServiceMonitor created automatically and a Grafana dashboard generated alongside it.

Total custom code: somewhere around 500 lines of Helm templates and Kyverno policies. Everything else is configuration of tools that already existed. That ratio is the whole point. The less bespoke code your platform carries, the less of it you have to understand at 3am when something breaks, which matters a lot to me. I want to be able to read my own platform top to bottom.

The Mistakes I See Most

A few patterns show up again and again, so steer clear of them.

Starting with the portal is the most common. The UI is the fun part, so people build it first, before the workflows underneath are even proven. Get GitOps and the CLI working, then put a face on it.

Ignoring existing services is the expensive one. A platform that only handles greenfield leaves you running two worlds in parallel forever. Migration has to be a first-class feature, not an afterthought.

Over-engineering security early will strangle adoption. Block every edge case on day one and people route around the platform entirely. Start permissive and tighten based on what incidents actually teach you.

Hiding all the complexity sounds nice until a senior developer needs to understand what’s underneath and can’t. Some people want to see the machinery, and that instinct is healthy. Give them escape hatches and real documentation.

Why Bother

Developer time is the most expensive thing on the team, and every hour spent fighting infrastructure is an hour not spent building the product. A platform that works changes the shape of the whole team: a junior can deploy safely on day one, a senior gets to spend their attention on the genuinely hard problems, and the operations burden shifts from the same tedious requests toward work that’s actually interesting.

The recipe doesn’t change as you scale. Start small. Solve a real problem. Grow when the feedback tells you to. The best platform I’ve used barely registered as a thing at all, because deploying just felt easy and nobody thought about why.

What an IDP Actually Is#

Why So Many Platforms Die#

Start With Problems, Not a Solution#

The Simplest Useful Platform#

Golden paths#

Self-service deployment#

Guardrails#

Layer One: The Building Blocks#

Service templates with Helm#

The namespace as a boundary#

Self-service through pull requests#

Layer Two: Growing the Platform#

What Not to Build#

Knowing If It’s Working#

The Team Behind It#

What This Looks Like in Practice#

The Mistakes I See Most#

Why Bother#