BLOG
What can the LGPS learn from a global IT outage?
02 Sep 2024
19 July is my wedding anniversary. This year - following a meal at one of our favourite restaurants – I had to go to the cash machine next door to settle our bill after the card machine crashed. It was a minor inconvenience in what had otherwise been an enjoyable evening.
I hadn’t appreciated the wider impact of this until the next morning when I realised that many people will remember 19 July 2024 for an entirely different and less celebratory reason.
Catastrophes rarely arise out of nowhere, and what happened in July was no exception.
What triggered the July outage?
On 28 February, the antivirus company CrowdStrike released a new version of their software - “Sensor 7.11”. This introduced a new mechanism to detect and defend against cyberattacks on a type of network connection used within Microsoft’s Windows operating system (called “named pipes” if you’re interested). This software works at a “driver” level – at the very core of the operating system. Installing such a driver is a complex, time consuming and fragile process.
CrowdStrike’s new release was designed to automatically update in the face of new developing threats using configuration files. This would give them the ability to update the software quickly without releasing an entirely new version. All they needed to do was to deploy a new configuration file that described the new threat and job done. No potentially time consuming and complex installations to perform. Just a single file which could be deployed automatically.
On 24 April, the CrowdStrike team released their first 4 configuration files, and everything went as expected. Threats averted! Champagne all round! However, the scene was now set for one of the most expensive IT failures in history.
On 19 July, the team deployed two further configuration files. Unbeknownst to anyone, there was a bug in the configuration file validator and one of these two files was malformed. This caused CrowdStrike’s deeply embedded software to crash and Windows did what it can only do when a critical component crashes: it crashed too - displaying the infamous “blue screen of death”.
Due to the automated deployment process and the fact that this happened overnight for most of CrowdStrike’s 20,000 clients meant that within only a few hours 8.5 million Windows devices including bus stops, supermarket checkouts and even MRI scanners were left unusable.
Carnage ensues
Many airlines ground to a halt. Disruption to banks and electronic payment systems caused pileups at supermarkets and petrol stations. Pharmacies couldn’t fill prescriptions. Healthcare providers were forced to cancel surgeries and I got off lightly - I had to walk to a (luckily unaffected) cash machine.
So far, the cost of this incident to US Fortune 500 companies alone has been estimated at $5.4 billion. Delta Airlines estimated their costs at $500 million. Interestingly though, only a fraction of these losses will be recovered by insurance.
But why is this relevant to the LGPS and what lessons can we learn from this?
Lessons learned for the LGPS
The events in July made everyone aware in no uncertain terms how reliant society is on digital systems. There is need for organisations to be more cyber aware, cyber resilient and to have plans in place to deal with disruptions.
How we as individual organisations reason about how we deal with further potential complex global outages can seem overwhelming and much of this responsibility lies with technology vendors to resolve. They need to recognise the consequences of working within a wider eco system, where many vendors and even industries are highly connected and interdependent on each other to function. The relatively swift recovery following the incident was a testament to highly efficient and effective collaboration protocols already established between hardware and software vendors.
Yet, whilst many of the resolutions to these types of incidents must be left in the hands of experts, there are mitigating actions that we can all consider locally within our own organisations:
- Business continuity plans – ensuring that these are up to date, and that your team rehearse them regularly with tabletop or simulation exercises helps you prepare for disaster. Regularly reviewing and updating these in the face of changing circumstances ensures your team are ready to deal with any incidents.
- Education - cybersecurity can be a complex and daunting topic but education to increase your team’s technological acumen can increase your resilience. It is often true that much of the damage from outages stems not just from the incident itself, but from the subsequent responses, or lack thereof.
- Communications - after an outage, the way an organisation responds is crucial. If the response is delayed, disorganised, or inappropriate, it can exacerbate the situation. Poor communication can lead to confusion and panic among stakeholders, worsening the impact.
- Others - be aware of the vulnerability of monocultures, ensure a high quality (not quantity!) of testing, and perhaps the most obvious when looking through the hindsight lens is the importance of phased, rather than “big-bang”, rollouts.
These all require some level of investment. There is often an effort to drive down the costs of both IT and systems, but as my grandmother used to say, “there is nothing you can’t make a little worse and a little cheaper”.
Ostensibly saving money, by cutting back or ignoring important qualitative elements of technology, can prove a false economy that many organisations found out the hard way in July.
And finally, you can read CrowdStrike’s own post-incident report here.
If you would like to discuss any of this further, please get in touch.
0 comments on this post