Lessons from CrowdStrike update failure: Make sure your incident recovery plan is ready – Howard Solomon Reports

Share post:

Last month’s global failure of a CrowdStrike update shows the importance of network visibility and being prepared for the collapse of critical IT systems, says an expert.

“It was definitely a significant event, and when any of those events happen it really is about enacting your major incident response procedures,” Denis Villeneuve, cybersecurity and resilience practice leader at Kyndryl Canada said in an interview.

“These things happen,” he said, referring to the July 19 distribution of a flawed CrowdStrike Falcon content configuration update that caused an estimated 8.5 million Windows servers and desktops of Falcon customers to crash.

In a report CrowdStrike said blamed the crashes on a failure to validate the number of fields in an update template.

Every IT department can be exposed to third party risk, Villeneuve said. “It’s a question of preparing for these types of incidents. The last big one we had was Log4J, and it went down a similar path in terms of preparedness and being able to respond.”

Kyndryl is an IT services provider. Villeneuve said globally hundreds of its customers used CrowdStrike, and over 43,000 of their servers were impacted. Some had to be rebuilt, while others only needed to have Falcon quick fixes installed. He said 85 per cent of Kyndryl customers had fully recovered their systems within 24 hours. The remainder were up within 72 hours.

The interview came as Microsoft announced it will hold a Windows security summit September 10th to discuss how to improve IT systems in the wake of the incident.Our discussions will focus on improving security and safe deployment practices, designing systems for resiliency and working together as a thriving community of partners to best serve customers now, and in the future,” said Aidan Marcuss, Microsoft vice-president for Windows and Devices. 

“We look forward to bringing our perspective to the discussions with Microsoft and industry and government stakeholders on the need for a more resilient ecosystem,” a CrowdStrike spokesperson told Reuters.

Villeneuve emphasized that one lesson IT leaders should learn from the CrowdStrike incident is having a resilient IT infrastructure so it can withstand such failures.

“It’s important to continuously improve our defence and recovery capabilities. It [the incident] demonstrates the importance of preparedness and real-time visibility. Being able to have end-to-end visibility of your entire IT and status to be able to react to the most mission critical areas of your business is very important. The analytics you can set up ahead of time will allow you to respond more quickly.

“I’ve met [over the years] with organizations big and small – mostly big – and a lot of it is around taking the time to look at our disaster recovery capabilities and resiliency of our businesses.”

He also noted that governments are increasingly stepping in to force organizations to act. For example, said, the European Union’s Digital Operational Resilience Act (DORA) forces financial institutions in the EU to do digital operational resilience testing of their information and communications systems.

Canada’s proposed cybersecurity legislation (C-26, the Critical Cyber Systems Protection Act), which initially covers four critical infrastructure sectors (banking, telecommunications, transportation and interprovincial pipelines), includes a part mandating the mitigation of sup[ply chain and third party risks.

Unfortunately, “we see disaster recovery plans that haven’t been touched in quite a few years,” Villeneuve said.

That, he said, is because IT departments are financially constrained. “You can only do so much with the budget you are allocated.” That has meant over the last few years that organizations haven’t been focusing on resiliency. When a crisis like Log4J and CrowdStrike pops up “it sort of gets board attention and there’s additional funding that goes into improving the DR plan and making sure companies are meeting their fiduciary obligations.”

Disaster recovery plans must be up to date to cover infrastructure change like digital transformation, he said – and the plan has to be regularly tested.

One other lesson from the CrowdStrike incident for IT leaders: Where possible, Villeneuve said, spread the installation of application updates. For example, 20 per cent of computers get an update at a time over a limited period. That allows IT administrators to see if the update is a problem without putting the entire organization at risk.

Howard Solomon
Howard Solomonhttps://www.itworldcanada.com
Currently a freelance writer, I'm the former editor of ITWorldCanada.com and Computing Canada. An IT journalist since 1997, I've written for ITBusiness.ca and Computer Dealer News. Before that I was a staff reporter at the Calgary Herald and the Brampton (Ont.) Daily Times.

SUBSCRIBE NOW

Related articles

FBI’s Operation Level Up Ends Cyber Scams and Saves Millions of Dollars and Lives

We should send a love note out to The Federal Bureau of Investigation (FBI) who launched Operation Level...

DOGE’s Teen Hacker Stirs Concern Over Musk Team’s Access to Federal Databases

A 19-year-old named Edward “Big Balls” Coristine has raised red flags after Wired revealed he holds a key...

Deep Seek and Open Source AI – Without the Hype: Discussion with Robert Falzon, Head of Engineering, Check Point

DeepSeek AI is shaking up the cybersecurity world—are we prepared for the risks? Join host Jim Love and...

OpenEuroLLM: Europe’s €52M Bet on Challenging US and China in AI

A new coalition, backed by the European Commission and referred to as OpenEuroLLM, is rallying more than 20...

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways