It seems that the industry in general never learns the lessons about backup -- backup -- backup.There is perhaps a lesson here about backup regimes and the sensibility of running with data on the PC rather than using them more as thin clients, having local apps which access remote data.
It should be possible to reinstate a past snapshot image of a disk in mere seconds. Plus whatever it takes to stop CrowdStrike running and invoking the 'Brick My PC' update.
It appears the cost of this debacle is estimated at being over $5 billion, and I'm not sure that includes incidental costs to the public affected by it.
That aside, there are two other lessons for large scale, mission-critical systems.
- Never apply system updates to live systems without testing them offline. If this had been done, the Crowdstrike damage would not have struck the crowd.
- Another single point of failure has been identified. Much as it is nice to restrict systems to one architecture, at this level the software system should always be implemented on two different hardware platforms, with two different system software regimes. That way, at the cost of extra complexity and higher support costs, an overall system can be made such that it does not rely on any one software component. If the Win servers go down, the Linux ones may not be affected. The overall system response and throughput will be degraded, but it will not be struck out for the crowd's count.
Statistics: Posted by davidcoton — Sat Jul 27, 2024 2:08 pm