ChatGPT Won't Replace Your Pipeline
The first time I asked ChatGPT to write a deployment runbook, it did. That was the problem. The output was close enough to be dangerous: kubectl steps, rollback sequence, health check endpoints. Structured, clear, apparently professional. But it had no idea whether any of it belonged to us. Whether our tooling matched what it described. Whether the service was subject to SOX controls or just basic SLO monitoring. It wrote a competent generic runbook for a deployment process that didn’t exist at our org. We already had something for that. It was called Google. ...
The Agent Is 20% of the Work. The Platform Is the Other 80%.
A payroll team shipped a production AI agent last year. Real workload, not a demo: processing 3,000+ emails a day, classifying them, extracting data and entering payroll. Six distinct steps, end to end. Their test accuracy: 94%. Good enough to ship. Their production accuracy: 70%. That’s the talk I keep thinking about from AI Dev 26. The drop itself isn’t news. What they did about it is. The accuracy gap has a cause The 94% looked clean because the test set was curated. It covered the cases the team had thought of. Production didn’t care about that. It sent typos. Impossible numbers. Screenshots. Hand-drawn notes. Vague references with no context. Conflicting instructions from two people in the same email thread. ...
What DevOps Taught Me About Running a Function
What DevOps Taught Me About Running a Function Or: the three metrics I use to know if a platform team is healthy. The first time I inherited a platform team I asked the obvious question. How is the platform doing? Uptime green, deploys up, tickets closing faster than they were opening. Two months later I knew none of those numbers had told me anything about whether the team was actually doing its job. ...
Why I Stopped Writing (And What Happened Since)
One of the last post on this blog went up on August 2022. Time to restore service. Twenty-eight articles in nine months, and then nothing for three and a half years. I wasn’t burned out. I didn’t lose interest. The blog went quiet because I took a new job two weeks later, and the work ate the writing. That’s the honest version. The strategic version, the one that matters now, is that the work itself was the foundation I needed for what I’m doing today. I just couldn’t see that while I was inside it. ...
AI Doesn't Fix Your Development Problems
AI code generation is the most powerful version of the ’new tool, same rework’ pattern I’ve ever seen. Most engineering organizations are walking straight into it.
The Unsung Hero of Keeping Users Happy
Have you ever had one of those days when you’re cruising through a task in the zone, and suddenly - boom - your software crashes? Frustrating. Today, let’s chat about the metric that aims to minimize frustration: Time to Restore Service (or TRS for the acronym lovers among us). What’s Time to Restore Service? Imagine you’re hosting a party, and suddenly, the music stops because of a power outage. Time to Restore Service is akin to the time it takes to get that music back up and pumping. In software, it’s time to restore a service or application to its full function after an incident or failure. ...
The Bumpy Ride We All Need to Smooth Out
Today, we’re diving deep into the roller-coaster world of Change Failure Rate (CFR). By the end of our chat, you’ll see why this metric is like the unsung hero of successful software delivery. What is Change Failure Rate? In a nutshell, Change Failure Rate is the percentage of deployments that fail. It measures how often we hit a bump when introducing new changes. So, Why Should We Care? It’s great to deploy often (remember our chat about Deployment Frequency?), but not at the cost of quality. A high CFR indicates we might be moving fast but also breaking things along the way. ...
The Heartbeat of Agile Teams and Happy Customers
Today, let’s chat about deployment frequency, diving deep into why deploying regularly to production can make our teams more agile and our customers endlessly happy. First Off, What Exactly is Deployment Frequency? In the simplest terms, Deployment Frequency is how often we push our code into production. It could be multiple times a day, once a day, weekly, or even monthly. It’s like the heartbeat of our software delivery, indicating the pace at which we’re delivering value to users. ...
The Clock That Fine-Tunes Your Delivery Pipeline
If you’ve been around the software delivery block, you’ve probably heard the saying, “Time is money.” But in our world, it’s not just money—it’s about delivering value, staying agile, and being ahead of the curve. Today, I want to chat about Lead Time for Changes. What is Lead Time for Changes? Lead Time for Changes tracks when a developer commits a change (like a new feature, a bug fix, or a performance tweak) to when it’s successfully deployed into production. It’s the duration between that initial “Aha! Let’s do this” moment to the “It’s live!” celebration. ...
Speed vs. Stability: Striking the Right Balance
Remember when you wanted to have your cake and eat it too? In software delivery, that cake is the delicate balance between speed (deploying features swiftly) and stability (ensuring the system runs smoothly without glitches). Ah, the eternal tug-of-war between moving fast and not breaking things. The Great Dilemma: Speed or Stability? Before we dive in, let’s first understand the core of the problem. The Need for Speed: In our high-paced digital era, businesses want to push out features, fixes, and updates at the speed of light. It’s all about meeting customer demands, staying ahead of competitors, and adapting to market changes quickly. Plus, there’s an undeniable thrill in deploying rapidly, right? ...