Self-Serve without Tickets: A Developer-First CI Control Plane

The problem: tickets, handoffs, and toil

When a team needs CI capacity (or a change), the default in many orgs is a ticket to another team. That creates toil—manual, repetitive work that scales linearly with service growth—exactly the kind of work SRE guidance says to eliminate or cap. SRE book · SRE workbook

Meanwhile, each team invents its own way to "get a runner," ballooning cognitive load. Platform engineering's advice: treat the platform as a product and reduce cognitive load with paved/golden paths. Team Topologies

The real case: ticket ping-pong that blocked delivery

Context. A Series-B product team (Linux/Windows/macOS builds) needed new CI capacity every sprint. The flow looked like this:

Developers → DevOps: "Need 150 runners for load tests."
DevOps → IT: VMs, network, images, policies.
Back-and-forth clarifications; DevOps installs toolchains; agents register.
Validation fails on a subset → more tickets.

Baseline pain (3-month avg).

3–7 days from request to first green build for a new team.
150 VMs with full stacks: ~3 days of human effort end-to-end.
Runner tickets: 90–120/month; ~42% bounced DevOps↔IT at least once.

Decision. Replace tickets/DIY scripts with a developer-first control plane: quotas/RBAC, Pool Definitions (OS, size, role packs, CI connector), immutable updates (spec change → safe rotation with audit log), and optional approvals. This mirrors platform-as-a-product and golden path guidance. Team Topologies · Red Hat

Design goals (and non-goals)

Self-serve with guardrails: developers provision CI without tickets; RBAC/quotas keep it sane. Team Topologies
One way that works: a golden path for runner pools (Linux/Windows/macOS), so teams don't reinvent scripts. VMware
Toil reduction: routine runner work becomes platform features. Google SRE
Anywhere: same DX in cloud, on-prem, and air-gapped.

What we avoid

Ticket-based provisioning as a primary path.
In-place edits to runners (updates flow via definitions — see Post #1).

Architecture at a glance: product, not playbooks

Create project (scoped quotas, RBAC).
Define pool (OS + size + role packs/toolchains + CI connector).
Attach CI (Jenkins agent or GitHub runner).
Change spec → rotate safely (immutable replace; audit log).

This respects how GitHub self-hosted runners / Runner Scale Sets and Actions Runner Controller work — excellent primitives the control plane productizes into a uniform flow.

Walkthrough (selective, not a manual)

Scenario: A backend team needs 10 Linux runners with Docker + build tools, connected to GitHub.

project create --name backend-app --quota pools=2,nodes=20
pool define  --name backend-ci --os=linux --size=10 \
  --roles=docker-<pinned>,build-essential --ci=github-runner
pool apply   backend-ci

apply.spec   pool=backend-ci os=linux size=10 roles=[docker...,build-essential]
create.node  node=backend-01 ... backend-10
register.ci  node=backend-01..10 ci=github agent_id=GHA-...
pool.ready   pool=backend-ci nodes=10

Self-serve CI control plane screenshot 1

Self-serve CI control plane screenshot 2

"Isn't this just ARC/scale sets?"

Others vs. Control Plane (quick read)

ARC / Runner Scale Sets: autoscale self-hosted runners on K8s; you still own images, versions, and updates. Great primitives; not a golden path by themselves. ARC docs · Scale sets
DIY scripts / tickets: flexible, but high toil and inconsistent results. SRE
Control plane: opinionated flows + immutable pools + RBAC/quotas + audited updates → consistent, low-toil CI across OSes.

What changed (numbers that matter)

KPI	Before	After
Time-to-first-pipeline (new team)	3–7 days	2–6 hours
Provisioning 150 VMs w/ stacks	~3 days (human effort)	Spec apply ~2 min; healthy ~12–20 min
Runner tickets / month	90–120	20–30 (≈ −78%)
Ticket reassign rate (DevOps ↔ IT)	~42%	<10%

Expected: eliminating toil and providing a golden path reduces handoffs and cognitive load. SRE · Team Topologies

Guardrails: self-serve ≠ "anything goes"

RBAC + quotas: org/project roles; caps on pools/nodes/burst size. Team Topologies
Approval hooks (optional): e.g., pools over a threshold require an approver.
Cataloged role packs: pinned toolchains; no random images. Red Hat
Audit trail: spec diffs + rotation logs for every change.

Cloud, on-prem, air-gapped: same experience

Cloud / connected on-prem: SSO, webhooks, registries via proxy.
Air-gapped: offline bundles and internal mirrors keep the workflow identical with zero egress. Air-gapped CD

References

Google SRE — Eliminating Toil: sre.google/sre-book/eliminating-toil · workbook
Team Topologies — Platform as a Product & cognitive load: teamtopologies.com/key-concepts
Golden paths in practice — VMware · Red Hat
GitHub Actions — Runner Scale Sets & ARC: scale sets · ARC
Air-gapped pipelines — hoop.dev