Let’s talk about CrowdStrike. The outage that rang around the world.
It was very serious, I don’t want to take away from that. However, it was not surprising as well. A chain of events that created a perfects storm. Those things are documented across the news cycle at this point.
This article at the Verge talks about the RCA in some detail. To quote the important parts of the article
To prevent this from happening again, CrowdStrike is promising to improve its Rapid Response Content testing by using local developer testing, content update and rollback testing, alongside stress testing, fuzzing, and fault injection. CrowdStrike will also perform stability testing and content interface testing on Rapid Response Content.
…
On the driver side, CrowdStrike will “enhance existing error handling in the Content Interpreter,” which is part of the Falcon sensor. CrowdStrike will also implement a staggered deployment of Rapid Response Content, ensuring that updates are gradually deployed to larger portions of its install base instead of an immediate push to all systems. Both the driver improvements and staggered deployments have been recommended by security experts in recent days.
My background with CI/CD:
- I have managed three different CI/CD processes/environments through out my career.
- I have managed CI/CD for both small(<10) and large multi-team(30+) deployments.
- Of the three, two of which I created from the ground up.
- 15+ years combined of being responsible for some form of the deployment process.
My opinion on the response: From my experience in CI/CD and deploy processes, these are changes that should always be considered at the beginning. Discusses, and documented.
From my experience, it sheds some light into the possible CI/CD culture behind the scenes.
Over years, I have learned some very important lessons about CI/CD. This is not all, but some that come to mind after reading that article.
📌 CI/CD protects the company’s brand and reputation.
So many times the discussions regarding have began with “efficiency” and “developers need to move faster”. Conversations I try to steer away from, you can’t have speed and efficiency without having consistency first. Focus on security, customer experience, etc. Increasing speed in a system that is unstable leads to generating more bugs that features.
Bugs, instability, security…are all caused by CI/CD processes not working effectively. Auditable changes, documentation, automation, clear direction and decision making are all results of a proper CI/CD process.
Moving fast is great, it should also come hand in hand with thinking about “what is the worst that could happen?”
📌 QA and testing in general should be considered a temperature gauge, not blocker or a gate keeper.
I have been in discussions during RCAs where the focus has been on QA as the only “testing”. Even CrowdStrike realized during this incident that developers need to start writing unit tests with the quote “…testing by using local developer testing.”
A healthy business and team dynamic would know if QA is finding that 80% of the new code is failing in testing, or if only 10% of the new code is failing in testing. Sometimes the question of the release is not if the version of a release passed all tests, but how comfortable are they in the current situation.
QA tests are the testing required for the CD side of the process, unit tests by developers are the tests on the CI side. I was once in a meeting with other developers where they were talking about QA writing unit tests as well(hint: that is a big red flag for me).
📌 The quality of the CI/CD process is reflective of how well the teams work together.
No single person or team should feel responsible for breaking production with a healthy CI/CD process in place. There will be checks and balances, documentation and double checking, reviews and multiple sign off. Especially for something that is critical.
There is a reason for the slash between CI and CD, they should be independent of each other. Continuous Integration and Continuous Deployment. One group of people focused on writing code, heads down in the IDE. The other group of people planning and actually pushing the code out, validating the snapshots of the “soon to be” production environment.
A healthy process is where all worst case scenarios are discussed up front. A healthy process is where it is clear who is responsible for what. A healthy process is where teams support each other in those responsibilities as much as possible.
A healthy process is where everyone can feel empowered to bring up concerns. Based on CrowdStrike’s changes, I am wondering if and for how long others have been asking for those changes internally.
📌 Proper CI/CD can never work without leadership support.
This is the reality of CI/CD. No matter which direction it is taken, it comes down to leadership. Period. Full stop, end of story.
I have been in many discussions over the years where, if left up to developers, QA, devops, etc…it becomes who has bigger numbers or more authority. No one likes change, removal of access, or responsibility taken away. That may need to happen if it becomes someone else’s job.
Leadership setting those expectations to begin with is critical, so the discussions can be more fruitful. From the CloudStrike response, it sounds like leadership finally has an incentive to have a stable and consistent CI/CD process.