Global technological disruption showed we are just one mistake away from chaos

Michael Nagle/Bloomberg/Getty Images

Friday’s global CrowdStrike service outage was terrible. It could have been much worse.

A version of this story appeared in CNN Business’ Nightcap newsletter. To get it delivered to your inbox, sign up for free. here.


NY
CNN

In case recent events — an assassination attempt, a new Republican vice-presidential nominee, the sitting president contracting Covid before abandoning his re-election bid — haven’t left you anxious enough about the fragility of the global order, let’s not forget that a cybersecurity firm you’ve probably never heard of made a huge mistake that demonstrated how the internet could, without warning, simply grind to a halt.

While you may not have heard of CrowdStrike, you’re unlikely to forget it anytime soon. With a single mistake in a routine software update, the company triggered what was likely the largest computer failure in history, leading to the kind of technological meltdown its products are designed to prevent.

Although CrowdStrike said the faulty update had been rolled back, the problems it caused aren’t exactly the old “power-cycle” solutions most of us are used to. As my colleague Brian Fung reported, the bug that caused Windows computers to enter Blue Screen of Death mode can be fixed. But in many cases, it requires some painstaking work by a human.

Now might be a good time to buy your IT staff some nice coffee and bagel pastries because each and every affected device (for some organizations, we’re talking thousands) will likely need to be assessed by an administrator and rebooted into safe mode, and then the offending file can be manually deleted.

“It can’t be automated,” said Kevin Beaumont, a security researcher and former Microsoft threat analyst, in a post on X. “So this will be incredibly painful for CrowdStrike customers.”

And even if your business had nothing to do with CrowdStrike, the outage could have ruined your day.

Consider a coffee shop that uses third-party online reservation services, hires third parties to ship its orders, and accepts credit and debit cards through its point of sale, which is connected to payment processing systems. You didn’t have to be a CrowdStrike customer to be hurt by the company’s mistake, and that’s what made Friday’s service outage so frustrating.

We’ve had terrifying outages before, and we’ll surely have them again. But the scale of the CrowdStrike outage once again highlights how interconnected the world has become through a network that almost none of us understand and that largely regulates itself.

“There are organizations that we depend on so much that we don’t even realize how dependent we are until they stop functioning,” said Stuart Madnick, professor of information technology at MIT’s Sloan School of Management.

Microsoft estimated that the CrowdStrike service outage affected about 8.5 million Windows devices. Airlines canceled 5,000 flights worldwide on Friday, while delays persisted through the weekend and Monday. Hospitals and government services were limited, and 911 communications stopped working in some areas.

It would be easy to blame CrowdStrike for its sloppy system upgrades, airlines for failing to create robust backup protocols, or even Microsoft for dominating the personal computer market, but IT experts told me there are broader, systemic problems at play.

The centralized nature of cybersecurity companies means we now have “a few major points of failure,” said Anil Khurana, executive director of the Baratta Center for Global Business at Georgetown’s McDonough School of Business. “That in itself is not a bad thing, because the proliferation actually makes diagnosis even more difficult.”

But companies need “a better model for operational redundancy and backups,” Khurana said. “Our technology platforms have a mix of legacy systems coupled with modern systems, meaning the weakest link determines the overall performance of the system. I call it a ‘house of cards’ model.”

There are safeguards in place today, but regulators around the world have been sleeping on the sidelines when it comes to cybersecurity risk management. IT systems are truly critical infrastructure, Khurana said, suggesting they “should go through the same kind of rigor, testing and oversight that we see at companies like Boeing or JPMorgan.”

I asked Madnick if the world should expect more mass outages.

“This is bad enough,” he said. “Could it get any worse? The answer is yes, it could.”

As costly and time-consuming as it is to manually reboot millions of devices, Friday’s outage was ultimately an isolated error by a company that acted quickly to fix it.

A bad actor looking to cause serious damage could use software to “cause computers or other equipment to explode, catch fire, or burn; in which case it doesn’t just reboot, it destroys itself.”

Okay, there’s one nightmare scenario that would make us all want to go live in a cave. But before you start stocking up on canned goods, Madnick has another way of looking at our current situation.

“These technologies give us a lot of benefits that are really worth it 99% of the time,” he said. The most important thing is to be prepared for that 1% of times when things go wrong.