Platform Engineering 08 - CI/CD at Scale

Pipeline Sprawl

If you let five teams set up their own CI/CD, you end up with five completely different pipelines. Different linting rules, different testing strategies, different deployment procedures. Scale that to 20 teams and you have a maintenance nightmare that nobody can fully comprehend, let alone own.

Is maximum freedom the right approach? Not quite. When Team A and Team B deploy differently and both cause production incidents, even diagnosing the root cause requires first understanding each team's unique pipeline structure. The cost of that freedom is a total loss of consistency.

Code vs Configuration

	Pipeline as Code	Pipeline as Configuration
What	Teams write full pipeline logic	Teams fill in parameters for a shared template
Flexibility	Maximum	Constrained to template options
Consistency	Low -- every pipeline is unique	High -- same shape everywhere
Maintenance	Each team maintains their own	Platform team maintains the template
Onboarding	Steep learning curve	Fill in the blanks

The direction platform teams should aim for is the Configuration side. Each team specifies only what actually matters to them -- the language they use, their test commands, and their deploy targets. Everything else, from build mechanics to security scanning, is handled uniformly by the platform to ensure consistency and quality across the organization.

Reusable Workflows

GitHub Actions reusable workflows are a concrete example of this approach in practice.

# .github/workflows/deploy.yml (in each team's repo)
jobs:
  deploy:
    uses: org/platform-workflows/.github/workflows/standard-deploy.yml@v2
    with:
      service-name: my-api
      environment: production
    secrets: inherit

The team specifies what to deploy, while the platform handles the actual execution of building, scanning, testing, and deploying. The greatest advantage of this structure is that improving the pipeline once automatically propagates the update to every team using the template. When a security vulnerability is discovered and a new scanning step needs to be added, modifying a single central template applies the change across the entire organization.

GitLab CI achieves the same pattern through include and CI/CD components. The tools differ, but the underlying principle is identical: centralize the logic, distribute only the configuration.

Environment Promotion

commit --> build --> dev (auto) --> staging (auto) --> prod (manual approval)
                                      |
                                      +-- integration tests run here

The principle is to build the artifact once and deploy the same image across all environments. Differences between environments should be limited to configuration only -- environment variables and secrets. The moment build outputs diverge between environments, you open the door to the classic "it worked in staging but broke in production" problem. Adhering to this principle ensures confidence in the promotion process -- when an artifact passes staging, you can trust that it will behave identically in production.

Deployment Strategies

There are three common deployment strategies, each with distinct trade-offs.

A rolling update replaces instances one by one in sequence. It is simple to implement but has the downside of slow rollbacks when problems arise. Blue-green maintains two identical environments and switches traffic all at once. Rollbacks are fast, but maintaining duplicate infrastructure means costs double. Canary sends only a small portion of traffic to the new version first for validation. It is the safest approach, but it only becomes meaningful when observability infrastructure is in place to detect issues emerging from that small traffic slice.

In practice, starting with rolling updates and transitioning to canary deployments once observability is sufficiently mature is the realistic path forward.

Pipeline Speed

Slow pipelines directly erode productivity. There are teams that skip local testing because "CI will catch it" -- except that CI pipeline takes 45 minutes. When developers have to wait 45 minutes for feedback, they eventually learn to ignore pipeline results altogether. Without fast feedback, the very value proposition of CI/CD is undermined.

Maintaining pipeline speed requires dependency caching, test parallelization, and building only what has changed. Linting and unit tests should complete within 5 minutes, and heavier integration tests should be separated into a post-merge stage.

Ultimately, the best pipeline is one that developers never have to think about. You push code, the build and tests run on their own, and results come back quickly. When a pipeline becomes invisible, it is doing its job properly.

In the next post, we look at observability.

Where to go next

Pipeline Sprawl

Code vs Configuration

Reusable Workflows

Environment Promotion

Deployment Strategies

Pipeline Speed

Continue Reading

Platform Engineering 09 - Observability for Platform Engineers

Platform Engineering 10 - Security and Governance

Platform Engineering 11 - Building a Platform Team