CrowdStrike reveals why 8.5 million Windows PCs crashed last week
Cybersecurity firm CrowdStrike has traced the cause of a major system crash affecting 8.5 million Windows machines, to a bug in its test software. The company admitted in a post-incident review, that the problematic update was not adequately validated before being released last Friday. The update, intended to "gather telemetry on possible novel threat techniques," led to widespread system failures when deployed on CrowdStrike's Falcon software, used globally for malware and security breach protection.
CrowdStrike commits to enhanced testing procedures
In response to the incident, CrowdStrike has pledged to improve its testing procedures and error handling. The company also plans to implement staggered deployment in future updates as a preventive measure against similar incidents. The problematic update was part of a 40KB Rapid Response Content file, one of two types of updates issued by the company. These updates modify how the Falcon sensor behaves in detecting malware on Windows systems.
Content Validator bug allowed problematic update release
Despite having a cloud-based system to validate content before release, CrowdStrike's problematic update was approved due to a bug in the Content Validator. The company typically conducts both automated and manual testing on Sensor Content and Template Types. However, it appears that less rigorous testing was performed on the RRC that caused the crash. A previous successful deployment in March had instilled "trust in the checks performed in the Content Validator," leading to assumptions about the safety of this update.
Faulty update triggers Windows system crash
The problematic RRC was loaded into the sensor's Content Interpreter, triggering an out-of-bounds memory exception. CrowdStrike explained that "This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)." The company's assumption about the safety of the update rollout led to this unforeseen system failure. This incident underscores the importance of rigorous testing and validation procedures for all updates, regardless of their perceived safety or past success rates.
Enhanced testing to prevent future crashes
To prevent similar incidents, CrowdStrike plans to enhance its RRC testing using local developer testing, content update and rollback testing, stress testing, and fault injection. The company will also conduct stability testing and content interface testing on RRC. In addition to these measures, CrowdStrike is improving its cloud-based Content Validator to better scrutinize RRC releases. The company stated, "A new check is in process to guard against this type of problematic content from being deployed in the future."
CrowdStrike to implement staggered deployment for future updates
CrowdStrike is also planning to enhance error handling in the Content Interpreter. The company will ensure a staggered deployment of Rapid Response Content, so that updates are gradually deployed to larger portions of its install base, instead of an immediate push to all systems. This strategy, recently recommended by security experts, aims at minimizing potential system-wide crashes due to problematic updates in the future.