The catastrophic fire that destroyed an OVH data centre in Strasbourg, France, is a timely (if unwelcome) reminder of a basic truth about digital operations that is all too easily forgotten in this hi-tech age.
Like everything else in this world, technology is subject to the laws of nature, too.
OVH is Europe’s largest hosting provider providing a range of VPS, dedicated server and assorted web services. Of its four data centres in Strasbourg, the fire the raged from March 10-11 destroyed one completely and damaged at least two others.
The outage affected a number of clients. Most critically, it triggered a complete and irrecoverable loss of player data for videogame maker Rust.
We increasingly tend to think of data as existing in an abstract digital realm we call ‘the Cloud’. But as Rust and its customers have unfortunately found out to their great cost, ‘the Cloud’ is actually run on very real, physical hardware servers. And when fire, flood or any other type of disaster strikes, those servers are vulnerable.
Inevitably, attention turns to what lessons can be learnt from the OVH incident. As growth in global data centre traffic, driven mainly by cloud adoption, continues to accelerate, the world is becoming more and more dependent on data centre resilience and system up time. It’s hardly an exaggeration to say that physical data centre vulnerabilities are a threat to the modern world.
It should be pointed out that catastrophic fires at data centres are rare, and on the whole data centres are designed and built to very high standards of safety and security. But inevitably there are exceptions and there are accidents, as we have seen with OVH.
It would be wrong not to look at opportunities for learning from an incident like this, not least because, the more data centre traffic grows, the more serious and far-reaching the consequences of such incidents will be. Imagine if something similar happened to a hyperscale data centre run by Google or Amazon?
Design, Geo-Redundancy and Critical Infrastructure
In its analysis of the causes of the fire, the Uptime Institute raises several possibilities for how it started and how it ended up being so damaging. It points to early reports of a maintenance fault with a UPS system the day of the fire, and although it remains pure speculation in relation to the OVH set up, states that it is not uncommon to find UPS systems sited near battery cabinets. It’s easy to see how a fire starting in a UPS that caused batteries to ignite would quickly spread out of control.
Similarly, the Uptime Institute analysis mentions how the “auto-ventilation” convection-cooling towers used across several OVH data centres, for all that they are hailed as eco-friendly and energy-efficient, also have the potential to act like a chimney, sucking in air that feeds the fire.
What this is to say is that designs are never perfect. We always think we have the best, most secure and resilient system possible until something goes wrong and a flaw is revealed. Physically resilient data centre design is an ongoing learning process.
Another issue arises from the complete loss of data experienced by Rust. We know that traditional data centre hosting arrangements, where everything is stored and backed up on servers at one site, are more vulnerable than geo-redundant cloud hosting where data and resources are shared and replicated across multiple locations.
But the Uptime Institute again points out that complete geo-redundancy is expensive. Many cloud services continue to offer limited geo-redundancy only, with data replicated across adjacent data centres rather than over a significant geographic distance. That in itself creates vulnerabilities.
Finally, the OVH fire will perhaps serve to strengthen the arguments of those that believe data centres should be treated as critical national infrastructure, rather than simply as commercial assets.
Dr Alexander Taylor from the University of Cambridge is a leading voice in this area, arguing that the fabric of our society now depends on digital infrastructure as much as it does on energy, water supplies, transport and communication, all of which are designated as critical infrastructure.
The point of adding data centres to this list of essential national assets is that it would lift responsibility for resilience and security partly out of the hands of operators and give a stake to national authorities. This would mean that data centres and digital infrastructure in general would be prioritised for security planning at a national and international level.