Self-Serve without Tickets: A Developer-First CI Control Plane
How a Series-B product team cut a 3-day ticket queue to minutes
TL;DR
- Tickets slow teams. Self-service gives developers environments and pipelines without human gatekeepers—with guardrails, not chaos. Team Topologies
- Goal: eliminate toil by turning routine runner work into a productized control plane. Google SRE
- Principle: use golden paths—an opinionated, supported way to create, update, and use immutable pools across OSes. VMware · Red Hat
- Outcome: fewer handoffs, hours to first pipeline (not days), consistent results across Linux/Windows/macOS, in cloud, on-prem, or air-gapped setups.
Platform as a product → paved roads (golden paths) reduce cognitive load and variance. Team Topologies
The problem: tickets, handoffs, and toil
When a team needs CI capacity (or a change), the default in many orgs is a ticket to another team. That creates toil—manual, repetitive work that scales linearly with service growth—exactly the kind of work SRE guidance says to eliminate or cap. SRE book · SRE workbook
Meanwhile, each team invents its own way to "get a runner," ballooning cognitive load. Platform engineering's advice: treat the platform as a product and reduce cognitive load with paved/golden paths. Team Topologies
The real case: ticket ping-pong that blocked delivery
Context. A Series-B product team (Linux/Windows/macOS builds) needed new CI capacity every sprint. The flow looked like this:
- Developers → DevOps: "Need 150 runners for load tests."
- DevOps → IT: VMs, network, images, policies.
- Back-and-forth clarifications; DevOps installs toolchains; agents register.
- Validation fails on a subset → more tickets.
Baseline pain (3-month avg).
- 3–7 days from request to first green build for a new team.
- 150 VMs with full stacks: ~3 days of human effort end-to-end.
- Runner tickets: 90–120/month; ~42% bounced DevOps↔IT at least once.
Decision. Replace tickets/DIY scripts with a developer-first control plane: quotas/RBAC, Pool Definitions (OS, size, role packs, CI connector), immutable updates (spec change → safe rotation with audit log), and optional approvals. This mirrors platform-as-a-product and golden path guidance. Team Topologies · Red Hat
Design goals (and non-goals)
- Self-serve with guardrails: developers provision CI without tickets; RBAC/quotas keep it sane. Team Topologies
- One way that works: a golden path for runner pools (Linux/Windows/macOS), so teams don't reinvent scripts. VMware
- Toil reduction: routine runner work becomes platform features. Google SRE
- Anywhere: same DX in cloud, on-prem, and air-gapped.
What we avoid
- Ticket-based provisioning as a primary path.
- In-place edits to runners (updates flow via definitions — see Post #1).
Architecture at a glance: product, not playbooks
- Create project (scoped quotas, RBAC).
- Define pool (OS + size + role packs/toolchains + CI connector).
- Attach CI (Jenkins agent or GitHub runner).
- Change spec → rotate safely (immutable replace; audit log).
This respects how GitHub self-hosted runners / Runner Scale Sets and Actions Runner Controller work — excellent primitives the control plane productizes into a uniform flow.
Walkthrough (selective, not a manual)
Scenario: A backend team needs 10 Linux runners with Docker + build tools, connected to GitHub.
project create --name backend-app --quota pools=2,nodes=20
pool define --name backend-ci --os=linux --size=10 \
--roles=docker-<pinned>,build-essential --ci=github-runner
pool apply backend-ci
apply.spec pool=backend-ci os=linux size=10 roles=[docker...,build-essential]
create.node node=backend-01 ... backend-10
register.ci node=backend-01..10 ci=github agent_id=GHA-...
pool.ready pool=backend-ci nodes=10
"Isn't this just ARC/scale sets?"
Others vs. Control Plane (quick read)
- ARC / Runner Scale Sets: autoscale self-hosted runners on K8s; you still own images, versions, and updates. Great primitives; not a golden path by themselves. ARC docs · Scale sets
- DIY scripts / tickets: flexible, but high toil and inconsistent results. SRE
- Control plane: opinionated flows + immutable pools + RBAC/quotas + audited updates → consistent, low-toil CI across OSes.
What changed (numbers that matter)
| KPI | Before | After |
|---|---|---|
| Time-to-first-pipeline (new team) | 3–7 days | 2–6 hours |
| Provisioning 150 VMs w/ stacks | ~3 days (human effort) | Spec apply ~2 min; healthy ~12–20 min |
| Runner tickets / month | 90–120 | 20–30 (≈ −78%) |
| Ticket reassign rate (DevOps ↔ IT) | ~42% | <10% |
Expected: eliminating toil and providing a golden path reduces handoffs and cognitive load. SRE · Team Topologies
Guardrails: self-serve ≠ "anything goes"
- RBAC + quotas: org/project roles; caps on pools/nodes/burst size. Team Topologies
- Approval hooks (optional): e.g., pools over a threshold require an approver.
- Cataloged role packs: pinned toolchains; no random images. Red Hat
- Audit trail: spec diffs + rotation logs for every change.
Cloud, on-prem, air-gapped: same experience
- Cloud / connected on-prem: SSO, webhooks, registries via proxy.
- Air-gapped: offline bundles and internal mirrors keep the workflow identical with zero egress. Air-gapped CD
References
- Google SRE — Eliminating Toil: sre.google/sre-book/eliminating-toil · workbook
- Team Topologies — Platform as a Product & cognitive load: teamtopologies.com/key-concepts
- Golden paths in practice — VMware · Red Hat
- GitHub Actions — Runner Scale Sets & ARC: scale sets · ARC
- Air-gapped pipelines — hoop.dev