Some airlines, banks and government services around the world have been affected by a faulty software update for Windows PCs running CrowdStrike’s Falcon security suite which has knocked out desktop PCs.
“This is not a security incident or cyberattack,” the company said in a statement.
“The issue has been identified, isolated and a fix has been deployed. We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website.”
To make sure IT leaders aren’t suckered by threat actors trying to take advantage of the situation it urged them to make sure they communicate with CrowdStrike staff and only through official and known channels.
The problem is in a channel file named “C-00000291*.sys” with timestamp of 0409 UTC.
An update with an identically named file but timestamped later than that is fine.
If hosts are still crashing and unable to stay online to receive the Channel File Changes, the following steps can be used to workaround this issue:
Reboot the server to give it an opportunity to download the reverted channel file. If the host crashes again, then:
-
- Boot Windows into Safe Mode or the Windows Recovery Environment
- NOTE: Putting the host on a wired network (as opposed to WiFi) and using Safe Mode with Networking can help remediation.
- Navigate to the %WINDIR%\System32\drivers\CrowdStrike directory
- Locate the file matching “C-00000291*.sys”, and delete it.
- Boot the host normally.Note: Bitlocker-encrypted hosts may require a recovery key.
- Boot Windows into Safe Mode or the Windows Recovery Environment
CrowdStrike has workaround steps for IT environments in public and virtual clouds, as well as specific documentation for Azure and AWS environments and systems using Bitlocker.
CrowdStrike describes the issue as “a defect found in a single content update for Windows hosts. That has left some wondering what a “content” problem is. Joe Tidy, the BBC’s cyber correspondent, says one interpretation is it was “something as innocuous as the changing of a font or logo on the software design.”
However, because Falcon is an endpoint detection and response sensor individual PCs have to be fixed. “The automatic fix that CrowdStrike states they have made is a change so that this problematic update is no longer delivered to endpoint devices,” notes John Hammond, principal researcher at Huntress. “Unfortunately, this does not help the machines that are already affected and stuck in a boot loop. The mitigation and recovery workaround that is suggested is unfortunately a very manual process… it needs to be done at the physical location of the computer, by hand, for every computer impacted. It will be a very long and very slow recovery process.
“Affected computers will still need to be booted into ‘Safe Mode’ to make changes and fix the problematic CrowdStrike drivers. Others in the community have suggested other options like renaming the driver folder structure, or other tricks, but all of those efforts will still need to be manual and one-at-a-time. Automatic updates or group policy deployments are not a viable solution while the devices are stuck in a “boot loop” and unable to load the full Windows operating system.”
In England Sky News reports thousands of doctors offices have been affected after the widely-used EMIS appointment and patient record system went down.
In Canada, CBC News reports some healthcare services in Newfoundland and Labrador are affected.
Some cybersecurity vendors have been quick to comment.
“This incident is Microsoft’s fault, not CrowdStrike’s fault,” said J.J. Guy, CEO of Sevco Security. “Yes, CrowdStrike pushed a kernel-level update that causes wide-spread blue screens. Yes, that should have been caught during QA [quality assurance] and I’m sure we will get an after-action report that details why release procedures didn’t catch it. But software bugs happen. They are unavoidable – even for top-tier shops like CrowdStrike.
“This is a high-impact incident not because there was a blue screen, but because it causes repeated blue screens on reboot and [appears as of right now] to require manual, command-line intervention on each box to remediate (and even harder if BitLocker is enabled). That is the result of poor resiliency in the Microsoft Windows operating system. Any software causing repeated failures on boot should not be automatically reloaded. We’ve got to stop crucifying CrowdStrike for one bug, when it is the OS’s behavior that is causing the repeated, systemic failures.”
A spokesperson for Microsoft has been asked for comment.
Andy Ellis, operating partner at cybersecurity venture capital firm YL Ventures, said this incident is a reminder of lessons learned in 2004 when he was CISO at Akamai A metadata update – the configuration files that specify how to handle each customer’s traffic – went out to all Akamai servers, and a bad interaction with the software caused widespread issues, including crashing (“rolling”) servers. When those servers returned to service, they reread the problematic config, and the cycle continued. “We had fast rollbacks, at least, and the incident was very quickly cleaned up. But in doing safety analysis, this was a hazard that we wanted better mitigation, so we adopted crash rejection.”
When an application received a dynamic configuration update, the update system would drop it into a location specified for not-yet-read updates (“/inbound/”). Depending on the application, either the updater would ping it to tell it a new update was available, or the application regular polled /inbound/ looking for updates.
“The first thing the application would do is *hide* the update. It would move it from /inbound/ into /tmp/, keeping track of the new location in memory. Only then would it attempt to read and parse the update. If all went well, a few seconds later it would move the update away from /tmp/ and into its permanent location. If things didn’t go well? The application would crash, and, on restart, it would never notice the toxic update. It had auto-reverted, suffering only a single crash in the meantime.”
There is complexity, he admits, because a machine might crash for completely unrelated reasons. IT has to be able to notice and get it a new copy of the update, usually by having a second channel maintaining the list of safe updates in one form or another. “But if you’re writing dynamically updatable software, crash rejection is one of the many safety practices you need to incorporate.”
MORE COMMENT
One way to view this is like a large-scale ransomware attack,’ said Eric O’Neil, who runs The Georgetown Group, a Washington, D.C. investigative and security consultancy and a former FBI counterintelligence agent. “I’ve talked to several CISOs and CSOs who are considering triggering restore-from-backup protocols instead of manually booting each computer into safe mode, finding the offending CrowdStrike file, deleting it, and rebooting into normal Windows. Companies that haven’t invested in rapid backup solutions are stuck in a catch-22.”
(This story has been updated with comments from Huntress and Eric O’Neil. It has also been corrected to make clear Windows PCs, and not servers, running Falcon were impacted.