CrowdStrike admits a rookie mistake caused global IT outage
US cybersecurity firm CrowdStrike has disclosed the root cause of a faulty software update that crashed Microsoft computers worldwide in July. Well, it boiled down to a common mistake that first-year programming students learn to avoid. The company's 12-page analysis pinpointed an undetected sensor in its Falcon software update as the culprit behind this incident, now known as "Blue Screen of Death (BSOD) Friday." The malfunction impacted an estimated 8.5 million Windows systems globally on July 19.
Falcon software's role in the system crash
The Falcon sensor product, widely used by businesses and large organizations for ransomware, malware, and internet security protection, experienced an issue with an update. This software functions at the kernel level of Windows, closely monitoring user activity and application requests. Sigi Goode from the Australian National University explained that sensors act as "a pathway for evidence," directing Falcon on what suspicious activity to detect.
Update error and system crash
On the day of the BSOD Friday, CrowdStrike issued a Rapid Response Content update to specific Windows hosts. This update was supposed to have 20 input fields but instead had one extra. The company's report stated that this "count mismatch" led to an out-of-bounds memory read beyond the end of the input data array, causing a system crash.
Impact and aftermath of the system crash
The crash had significant consequences due to Falcon's deep integration with Windows's core. When Falcon crashed, it caused the entire system to fail, leading to the BSOD. CrowdStrike's CEO, George Kurtz, has been summoned before the US Congress to explain what happened. Kurtz apologized for the failure this week and assured that measures are being taken to prevent such incidents in future. "We are using the lessons learned from this incident to better serve our customers," he said.
Quality assurance processes under scrutiny
The incident has sparked questions about CrowdStrike's quality assurance (QA) processes. The company admitted in its report that the "lack of a specific test for non-wildcard matching criteria in the 21st field" contributed to the system crash. Toby Murray from the University of Melbourne's School of Computing and Information Systems, described it as an "incredibly basic and fundamental mismatch" that should have been detected by even basic quality review and assurance checks.