What caused the global Cloudflare outage?

Contents

As the dust settled on the internet’s latest major disruption, Cloudflare linked the cause of the error to an inflated ‘feature’ configuration file.

Yesterday (19 November), a major Cloudflare outage caused widespread disruption across popular websites and services on the internet.

Scores of unconnected websites and platforms, all linked by their usage of Cloudflare in their back-end operations, were hit by periods of extended downtime and loading issues, with many users seeing the error message “Please unblock challenges.cloudflare.com to proceed” when attempting access.

The disruption – which affected the sites and services of known companies such as X, OpenAI, Spotify, Shopify, Etsy, DownDetector and Bet365, as well as video game behemoth League of Legends – began just before noon yesterday and was completely resolved by just after 5pm, according to Cloudflare.

In the aftermath of the outage, Cloudflare co-founder and CEO Matthew Prince published a blogpost where he explained that the disruption was not caused by a cyberattack but rather an error in the company’s database systems.

File fracas

According to Prince, the outage was triggered by a change to one of the company’s database systems’ permissions, which caused the database to output multiple entries into a “feature file” used by Cloudflare’s bot management system.

The bot management system includes a machine learning (ML) model that it uses to generate bot scores for every request traversing the company’s network. Cloudflare’s customers use bot scores to control which bots are allowed or not allowed to access their sites. The ML model takes the aforementioned feature file – which is made up of individual traits used by the model to make a prediction about whether a request was automated or not – as an input.

“A change in our underlying ClickHouse query behaviour that generates this file caused it to have a large number of duplicate ‘feature’ rows,” Prince explained.

The feature file subsequently doubled in size, which caused the bots module to trigger an error. This inflated feature file was then spread to all of the machines that make up Cloudflare’s network.

“The software running on these machines to route traffic across our network reads this feature file to keep our bot management system up to date with ever-changing threats,” said Prince. “The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.”

After determining that the “symptoms” of the outage were not caused by a hyper-scale DDoS attack (as was the company’s initial fear), Cloudflare identified the issue and stopped the spread of the feature file and managed to replace it with an earlier version.

“We are sorry for the impact to our customers and to the internet in general,” Prince said. “Given Cloudflare’s importance in the internet ecosystem, any outage of any of our systems is unacceptable.

“That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team.”

‘Concentration risk’

With the outage resolved, Prince outlined a number of safeguards that Cloudflare is now working on in order to protect its systems should a similar error occur again in the future.

These include hardening ingestion of Cloudflare-generated configuration files “in the same way we would for user-generated input”, enabling more global kill switches for features, eliminating the ability for “core dumps or other error reports” to overwhelm system resources, and reviewing failure modes for error conditions across all core proxy modules.

Forrester principal analyst Brent Ellis told SiliconRepublic.com that the Cloudflare outage, along with the recent Amazon Web Services and Microsoft Azure outages, shows the impact of “concentration risk”.

Ellis predicted that yesterday’s outage could have caused direct and indirect losses of $250m to $300m due to the cost of downtime and the “downstream effects” of services such as Shopify and Etsy, which host online stores for “tens to hundreds of thousands of businesses”.

“Being resilient from failures like this means learning what type of outages that service provider might be vulnerable to and then architecting failover measures,” he said. “Sadly, resilience isn’t free and businesses will need to decide if they want to make the investment in alternative service providers and failover solutions.

“Some industries, like financial services, must already address these concerns as part of regulation. Given the high profile of cloud-related outages recently, I expect operational resilience regulation might spread outside the financial sector.”

Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.