Lessons from CrowdStrike update failure: Make sure your incident recovery plan is ready – Howard Solomon Reports

Share post:

Last monthā€™s global failure of a CrowdStrike update shows the importance of network visibility and being prepared for the collapse of critical IT systems, says an expert.

ā€œIt was definitely a significant event, and when any of those events happen it really is about enacting your major incident response procedures,ā€ Denis Villeneuve, cybersecurity and resilience practice leader at Kyndryl Canada said in an interview.

ā€œThese things happen,ā€ he said, referring to the July 19 distribution of a flawed CrowdStrike Falcon content configuration update that caused an estimated 8.5 million Windows servers and desktops of Falcon customers to crash.

In a report CrowdStrike said blamed the crashes on a failure to validate the number of fields in an update template.

Every IT department can be exposed to third party risk, Villeneuve said. ā€œItā€™s a question of preparing for these types of incidents. The last big one we had was Log4J, and it went down a similar path in terms of preparedness and being able to respond.ā€

Kyndryl is an IT services provider. Villeneuve said globally hundreds of its customers used CrowdStrike, and over 43,000 of their servers were impacted. Some had to be rebuilt, while others only needed to have Falcon quick fixes installed. He said 85 per cent of Kyndryl customers had fully recovered their systems within 24 hours. The remainder were up within 72 hours.

The interview came as Microsoft announced it will hold a Windows security summit September 10th to discuss how to improve IT systems in the wake of the incident.Our discussions will focus on improving security and safe deployment practices, designing systems for resiliency and working together as a thriving community of partners to best serve customers now, and in the future,” said Aidan Marcuss, Microsoft vice-president for Windows and Devices.Ā 

“We look forward to bringing our perspective to the discussions with Microsoft and industry and government stakeholders on the need for a more resilient ecosystem,” a CrowdStrike spokesperson told Reuters.

Villeneuve emphasized that one lesson IT leaders should learn from the CrowdStrike incident is having a resilient IT infrastructure so it can withstand such failures.

ā€œItā€™s important to continuously improve our defence and recovery capabilities. It [the incident] demonstrates the importance of preparedness and real-time visibility. Being able to have end-to-end visibility of your entire IT and status to be able to react to the most mission critical areas of your business is very important. The analytics you can set up ahead of time will allow you to respond more quickly.

ā€œIā€™ve met [over the years] with organizations big and small ā€“ mostly big ā€“ and a lot of it is around taking the time to look at our disaster recovery capabilities and resiliency of our businesses.ā€

He also noted that governments are increasingly stepping in to force organizations to act. For example, said, the European Unionā€™s Digital Operational Resilience Act (DORA) forces financial institutions in the EU to do digital operational resilience testing of their information and communications systems.

Canadaā€™s proposed cybersecurity legislation (C-26, the Critical Cyber Systems Protection Act), which initially covers four critical infrastructure sectors (banking, telecommunications, transportation and interprovincial pipelines), includes a part mandating the mitigation of sup[ply chain and third party risks.

Unfortunately, ā€œwe see disaster recovery plans that havenā€™t been touched in quite a few years,ā€ Villeneuve said.

That, he said, is because IT departments are financially constrained. ā€œYou can only do so much with the budget you are allocated.ā€ That has meant over the last few years that organizations havenā€™t been focusing on resiliency. When a crisis like Log4J and CrowdStrike pops up ā€œit sort of gets board attention and thereā€™s additional funding that goes into improving the DR plan and making sure companies are meeting their fiduciary obligations.ā€

Disaster recovery plans must be up to date to cover infrastructure change like digital transformation, he said ā€“ and the plan has to be regularly tested.

One other lesson from the CrowdStrike incident for IT leaders: Where possible, Villeneuve said, spread the installation of application updates. For example, 20 per cent of computers get an update at a time over a limited period. That allows IT administrators to see if the update is a problem without putting the entire organization at risk.

Howard Solomon
Howard Solomonhttps://www.itworldcanada.com
Currently a freelance writer, I'm the former editor of ITWorldCanada.com and Computing Canada. An IT journalist since 1997, I've written for ITBusiness.ca and Computer Dealer News. Before that I was a staff reporter at the Calgary Herald and the Brampton (Ont.) Daily Times.

SUBSCRIBE NOW

Related articles

Payment gateway breach exposes 1.7 million credit card holders

Slim CD, a payment gateway provider, recently disclosed a significant data breach that impacted nearly 1.7 million credit...

Taiwan chip manufacturer (TSMC) successful with first US production trial

TSMC has made significant strides with its $65 billion investment in chip manufacturing in their new Arizona fabrication...

Elon Musk Now Controls Two-Thirds of All Active Satellites

SpaceX, Elon Musk's space exploration company, has launched its 7,000th Starlink satellite, solidifying its dominance in the satellite...

AI Healthcare Firm Exposes 5.9 TB of Sensitive Mental Health Data

In a significant data security incident, Confidant Health, a Texas-based AI healthcare platform, inadvertently exposed 5.3 terabytes of...

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways