Last monthās global failure of a CrowdStrike update shows the importance of network visibility and being prepared for the collapse of critical IT systems, says an expert.
āIt was definitely a significant event, and when any of those events happen it really is about enacting your major incident response procedures,ā Denis Villeneuve, cybersecurity and resilience practice leader at Kyndryl Canada said in an interview.
āThese things happen,ā he said, referring to the July 19 distribution of a flawed CrowdStrike Falcon content configuration update that caused an estimated 8.5 million Windows servers and desktops of Falcon customers to crash.
In a report CrowdStrike said blamed the crashes on a failure to validate the number of fields in an update template.
Every IT department can be exposed to third party risk, Villeneuve said. āItās a question of preparing for these types of incidents. The last big one we had was Log4J, and it went down a similar path in terms of preparedness and being able to respond.ā
Kyndryl is an IT services provider. Villeneuve said globally hundreds of its customers used CrowdStrike, and over 43,000 of their servers were impacted. Some had to be rebuilt, while others only needed to have Falcon quick fixes installed. He said 85 per cent of Kyndryl customers had fully recovered their systems within 24 hours. The remainder were up within 72 hours.
The interview came as Microsoft announced it will hold a Windows security summit September 10th to discuss how to improve IT systems in the wake of the incident. “Our discussions will focus on improving security and safe deployment practices, designing systems for resiliency and working together as a thriving community of partners to best serve customers now, and in the future,” said Aidan Marcuss, Microsoft vice-president for Windows and Devices.Ā
“We look forward to bringing our perspective to the discussions with Microsoft and industry and government stakeholders on the need for a more resilient ecosystem,” a CrowdStrike spokesperson told Reuters.
Villeneuve emphasized that one lesson IT leaders should learn from the CrowdStrike incident is having a resilient IT infrastructure so it can withstand such failures.
āItās important to continuously improve our defence and recovery capabilities. It [the incident] demonstrates the importance of preparedness and real-time visibility. Being able to have end-to-end visibility of your entire IT and status to be able to react to the most mission critical areas of your business is very important. The analytics you can set up ahead of time will allow you to respond more quickly.
āIāve met [over the years] with organizations big and small ā mostly big ā and a lot of it is around taking the time to look at our disaster recovery capabilities and resiliency of our businesses.ā
He also noted that governments are increasingly stepping in to force organizations to act. For example, said, the European Unionās Digital Operational Resilience Act (DORA) forces financial institutions in the EU to do digital operational resilience testing of their information and communications systems.
Canadaās proposed cybersecurity legislation (C-26, the Critical Cyber Systems Protection Act), which initially covers four critical infrastructure sectors (banking, telecommunications, transportation and interprovincial pipelines), includes a part mandating the mitigation of sup[ply chain and third party risks.
Unfortunately, āwe see disaster recovery plans that havenāt been touched in quite a few years,ā Villeneuve said.
That, he said, is because IT departments are financially constrained. āYou can only do so much with the budget you are allocated.ā That has meant over the last few years that organizations havenāt been focusing on resiliency. When a crisis like Log4J and CrowdStrike pops up āit sort of gets board attention and thereās additional funding that goes into improving the DR plan and making sure companies are meeting their fiduciary obligations.ā
Disaster recovery plans must be up to date to cover infrastructure change like digital transformation, he said ā and the plan has to be regularly tested.
One other lesson from the CrowdStrike incident for IT leaders: Where possible, Villeneuve said, spread the installation of application updates. For example, 20 per cent of computers get an update at a time over a limited period. That allows IT administrators to see if the update is a problem without putting the entire organization at risk.