CrowdStrike thoughts and questions
The recent failed update raises questions for all customers of SaaS products, CrowdStrike, platform providers and the industry as a whole.
On Friday July 19 CrowdStrike pushed out a faulty update to customers of its Falcon endpoint security product running on Microsoft Windows. This caused widespread, prolonged outages as the trivial fix required human intervention on each machine. The fix could not be automated.
Thanks go out to the sysadmins who got most things functional in short order. There will be many lessons to learn as a result of this incident. Software is hard and I imagine particularly so when pressed to be responsive to evolving security threats. As we await the root cause analysis, here are my questions and thoughts.
For Customers
- Costs of SaaS – SaaS solutions like CrowdStrike provide benefits, but at the cost of less control. How comfortable are you with third-parties pushing changes to your systems without your knowledge and outside of your control?
- Change and Configuration Management – Most organisations implement formal change and configuration management processes for their systems. SaaS providers usually provide mechanisms for customers to have some level of control for when changes are applied to their environment, but not always and not for all types of changes...
- Risk Assessment – This has been called a "black swan" event and I am sure that most organisations would have assessed the risk as High Impact, Very Low Likelihood – if they had assessed it at all. That said, events like this do happen. A very similar event happened in 2010 with McAfee Antivirus.
For CrowdStrike
- Deployment – Why was the change not rolled out progressively instead of to all users at once? This would have limited the "blast radius" and is common best-practice.
- Quality Assurance - How did this change escape testing? CrowdStrike have stated that this change was not actually a code change in the core driver, but instead related to a configuration file ("content file" in their language). Are changes of content files tested/released differently to core driver changes?
- Resiliency of driver – Should it be possible for a "bad" content file to cause the driver to crash? Is this scenario tested for? I am confident that most customers would not expect a faulty configuration file to be able to crash their systems – particularly if these files can be pushed by CrowdStrike outside of the standard software change process.
For Microsoft
- Although not responsible for this incident, Microsoft have been stepping up to support their customers.
- Resiliency – Can the Windows boot process or driver (including config) update process be made more resilient? Do security tools like CrowdStrike require the low-level access which makes these kinds of errors possible.
- Platform control – Microsoft could exert greater control over what third-parties can and cannot do on their platform such as restricting low-level access. However it is this openness that has enabled a third-party ecosystem of innovative hardware and software on Windows. Alternatively, Apple operates a tightly-controlled platform which promotes improved security and stability. However they have been coming under increasing regulatory pressure as a result of these platform restrictions.
Similarly, Microsoft suggests they are restricted from locking down the platform when it comes to security software:
A Microsoft spokesman said it cannot legally wall off its operating system in the same way Apple does because of an understanding it reached with the European Commission following a complaint. In 2009, Microsoft agreed it would give makers of security software the same level of access to Windows that Microsoft gets.
For Everyone
It is worth thinking about how consolidation and lack of diversity in Internet infrastructure, such as security products, can make us all more vulnerable to human errors and malicious threats.