ยท Engineering Leadership  ยท 5 min read

Why Your CI/CD Pipeline Is Lying to You

A fast green build doesn't mean your software is production-ready. Here's what I learned building deployment systems at AWS scale, and how to build pipelines that actually tell the truth.

A fast green build doesn't mean your software is production-ready. Here's what I learned building deployment systems at AWS scale, and how to build pipelines that actually tell the truth.

A green CI build means your tests passed. It does not mean your software is ready to deploy. This distinction seems obvious, but almost every engineering organization I have worked with conflates the two, and they consistently pay for it in production incidents.

At AWS, we ran deployment pipelines for systems that affected millions of customers. The cost of a bad deploy was measured in minutes of customer impact at global scale. That environment clarified an essential reality: a good pipeline tells you when not to deploy, not just when your test suite finishes cleanly.

What Most Pipelines Actually Test

A typical CI/CD pipeline runs unit tests for fast, isolated coverage, executes integration tests to verify component interactions, triggers a build step, and perhaps finishes with a lint or static analysis check.

While this sequence is necessary, it remains entirely insufficient. It completely misses the massive chasm between a system that compiles and one that behaves correctly under genuine production conditions. Unit tests run against mocked dependencies and integration tests usually touch test databases loaded with synthetic data. Neither environment can predict what happens when your database hits 50 million rows, your cache is completely cold, and 10 concurrent deployments are running across your fleet.

Traditional pipelines also fail to account for the blast radius of a bad deployment. Your pipeline knows if the build passed, but it cannot see a subtle regression that causes a 2% spike in error rates under heavy traffic, which might only materialize an hour after the code goes live.

Finally, standard pipelines rarely verify whether the rollback path actually works. Most architectures test the forward deployment path with immense rigor while ignoring the reverse. This is completely backward. Rollback is the single most critical path in your system, and you need to run it frequently enough to trust it implicitly.

What a Truthful Pipeline Looks Like

Pre-deploy: The gates that matter

Canary analysis is the single highest-leverage investment you can make in deployment safety. You should deploy to a small fraction of your fleet first, monitor the system for 10 to 30 minutes, and closely compare error rates and latency distributions against the rest of the live fleet. The pipeline should only proceed if the canary remains completely healthy. This strategy catches a specific class of bugs that no amount of pre-deploy staging will ever find: the failures that only emerge from live production traffic patterns, real user data, and the unpredictable interplay of your software with live, upstream dependencies.

Alongside canaries, you need automated rollback smoke tests. Before every production deployment, you should run a rapid smoke test of the rollback procedure in a staging environment. Answering whether rolling back actually works right now is a question you want resolved long before you are forced to do it under pressure.

This requires strict deployment blast radius control. You must deploy serially across regions rather than pushing changes simultaneously. Your pipeline needs to explicitly identify which region acts as the initial canary due to lower traffic volumes, which regions represent full production traffic, and exactly what approval gates govern the movement between them.

Post-deploy: The gates almost everyone skips

Your infrastructure must also handle deployment freeze detection. A deployment that looks perfectly stable for five minutes but steadily degrades over two hours, which is a common signature for memory leaks or connection pool exhaustion, should trigger an automatic rollback. Your pipeline needs to keep watching the environment long after the deployment script finishes executing.

This monitoring must extend to business metric anomaly detection. Beyond purely technical metrics like error rates and system latency, your pipeline needs to track core business indicators such as checkout conversion rates, API call success rates, and user session duration. A deployment that passes every technical gate but causes a sudden 5% drop in conversions is still a broken deployment.

The Organizational Problem

The real reason pipelines lie is rarely technical; it is an organizational issue driven by the fact that pipeline quality itself is almost never measured. Teams obsess over deployment frequency, mean time to recover, and change failure rates, but they routinely ignore the false positive rate of the pipeline, which tracks deployments that were approved but ultimately caused incidents.

They fail to measure the false negative rate where good changes are blocked unnecessarily, the exact time it takes to detect a bad deployment, and the overall rollback success rate. If you do not measure pipeline quality, you will never invest the engineering resources to improve it. When a pipeline has a high false-positive rate and blocks good changes frequently, engineers simply learn to ignore the gates, which leaves you in a worse position than having no gates at all.

Practical Steps

To fix this, you can break your strategy down into immediate, medium-term, and long-term milestones.

For this sprint, focus on manually adding canary analysis to your next critical deployment, even if that just means personally comparing error rates before and after the push. You should also run a dedicated rollback drill by deploying to staging, forcing a rollback, and confirming that the environment recovers cleanly.

Over the course of this quarter, you can transition to automated canary analysis with defined success criteria, ensuring error rates stay below a strict threshold and P99 latency remains within bounds. You should also introduce at least one post-deploy monitoring window paired with explicit, automated rollback triggers, and start actively measuring your pipeline false-positive and false-negative rates.

Within the next six months, the goal should be full progressive delivery across your entire fleet using automated gates. This includes deep business metric integration into your deployment health checks and building comprehensive deployment playbooks that explicitly document what specific indicators to watch for across every core service.


The ultimate objective is a pipeline that your engineers trust enough to deploy on a Friday afternoon. That is not an act of bravado; it is the ultimate benchmark for a pipeline that actually tells the truth. If your team is naturally nervous about Friday deployments, your pipeline is lying to you.

Back to Blog
Using AI to Accelerate Intelligence

Using AI to Accelerate Intelligence

AI is a transformative technology, let's use it to improve ourselves. AI combined with Knowledge tracing will unlock the next evolution in education.

Taming AI Agents By Making Them Queue

Taming AI Agents By Making Them Queue

Direct API calls work fine for agent prototypes and one-off automations. When agents become core to how your system operates โ€” needing to be observable, comparable, and replaceable without cascading changes โ€” a different integration model is needed. Here's the architectural contract that makes it work.