A statement from the company said the bug caused ‘problematic’ content in an update to pass validation, causing the ‘blue screen of death’ for millions of operating systems.
Crowdstrike has shared an update on what went wrong last Friday (19 July) when approximately 8.5m Windows devices were affected by a global IT outage.
In a post-incident review, the cybersecurity company at the centre of the outage said the crash happened due to a bug in its system, which allowed “problematic content data” to pass validation.
“Based on the testing performed before the initial deployment…trust in the checks performed in the content validator, and previous successful IPC template instance deployments, these instances were deployed into production,” the report read.
“When received by the sensor and loaded into the content interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash.”
While the crash hit millions of Windows devices around the world, it caused particular disruptions to airports, banks and healthcare services.
The outage was quickly linked to a flawed cybersecurity update from Crowdstrike and by the afternoon, the company had issued a fix and assured users that it was not a cyberattack.
CEO and president George Kurtz apologised for the outage and noted the “gravity and impact of the situation”.
In its latest update, Crowdstrike also laid out plans to ensure a similar issue can be prevented, including additional validation checks and improved testing by using testing types such as local developer testing and content update and rollback testing.
It also stated it will implement a “staggered deployment strategy” in which updates are gradually deployed and closely monitored to help guide a more phased roll-out.
“In addition to this preliminary post-incident review, Crowdstrike is committed to publicly releasing the full root cause analysis once the investigation is complete.”
While IT systems are still recovering from last week’s outage, another issue that has arisen is an increase in phishing campaigns from cybercriminals capitalising on the disruption.
Cause for concern
A day after the outage, Microsoft said in a statement that the incident “demonstrates the interconnected nature” of the tech ecosystem and said it served as a reminder of the importance of the “safe deployment and disaster recovery using the mechanisms that exist”.
However, IT experts have raised further concerns about the increased likelihood of such events when behemoths such as Microsoft and Crowdstrike are connected to so many devices that are, in turn, connected to so much critical infrastructure.
During the incident last week, ESET Ireland highlighted the danger of low diversity when it comes to the use of large-scale IT infrastructure.
“This applies to critical systems like operating systems, cybersecurity products and other globally deployed applications. Where diversity is low, a single technical incident, not to mention a security issue, can lead to global-scale outages with subsequent knock-on effects.”
David Ferbrache, managing director of cybersecurity consultancy company Beyond Blue, said there are also risks around rapid automated updates in live production environments.
“Organisations [need] to have the ability to control how those updates are applied and to balance the risk of a deferred update (potentially leaving security issues open but allowing additional testing) against the risk of immediate application,” he said. “This is a fine balance and sophisticated customers need to be able to strike that balance.”
The seriousness of the outage also means that the Crowdstrike CEO has been called before the US Congress to testify.
First reported by The Washington Post, the letter sent to Kurtz said: “While we appreciate Crowdstrike’s response and coordination with stakeholders, we cannot ignore the magnitude of this incident, which some have claimed is the largest IT outage in history.”
Ferbrache added that as insurance cases and legal debates around liability are mounting, the incident showed how much damage can be done when something goes wrong in such an interconnected digital world.
“Governments will be keen to work with the security industry to understand how we can avoid such situations happening again in the future,” he said.
Find out how emerging tech trends are transforming tomorrow with our new podcast, Future Human: The Series. Listen now on Spotify, on Apple or wherever you get your podcasts.