How Thermal Camera Surveys Protect Critical Data Centre Operations
Data centres and server rooms provide a managed and protected environment for server operations. Within these environments are complex infrastructure systems that must be inspected and maintained regularly but just how do you spot problems before they affect operational resilience? Regular maintenance and consumable replacement are one answer but even this can miss potential issues. The answer is a thermal camera survey.
Thermal Camera Surveys
A thermal camera is also known as an infrared (IR) camera and captures what cannot be seen with the naked eye i.e. an image of the radiation being emitted by the different infrastructure components within a building. Camera images can be taken at the building incomer, substation transformer, LV switchboard and all the way along the critical power path to uninterruptible power supplies, batteries (VRLA and lithium) power distribution units and electrical sub-distribution and wiring. HVAC (heating and ventilation air conditioning) system can also be photographed including chillers and cooling units as well as server racks and containment arrangements.
For some organisations a thermal camera survey may be mandatory e.g. as part of the annual insurance review or certification status such as the Uptime Institute’s Tier-rating system. For others the addition of a thermal camera survey to a preventative maintenance visit can provide more peace of mind and especially where older systems are deployed that may be approaching their design or useful end-of-life.
It is important to follow a set and documented procedure when carrying out a thermal camera survey. The survey itself can be a standalone service or one coupled to another such as a preventative maintenance visit, data centre audit or risk assessment project.
Critical Infrastructure ‘Hot-Spots’
In the data centre world the term ‘hot-spots’ is often used to refer to high temperature areas within a server rack. This can often be caused by several factors including poor air flow management and the arrangement of servers and UPS systems within the rack.
In the electrical work the term is also used in the same way but may not necessarily result from poor equipment layout or air flow. ‘Hot-spots’ in a valve regulated lead acid (VRLA) battery or set of AC or DC capacitors indicate ageing or poor manufacture. The heat rises due to areas of internal resistance that if not dealt with could lead to a potential fire risk. Within electrical switchgear high temperatures can indicate load imbalances, underrated devices and electrical harmonics. Again issues that if not tackled can lead to fire risks and system breakdowns.
Thermal Image Records
Thermal imagers such as a Fluke camera, measure actual surface temperatures and can store two-dimensional images of an object for comparative purposes. Captured images can then be used to identify temperature anomalies and areas that are
either hotter or colder than others around them or than expected.
As well as identifying ‘hot-spots’ the images can be stored digitally in a Cloud service and/or submitted with a visit report. The benefit of retaining the images being that they can provide a thermal audit record for changes in temperature over the life of an asset or component within the building’s infrastructure. Changes and anomalies can identify a need for investigation, maintenance or system upgrade or swap-out.
A Data Centre Thermal Survey Checklist
Any survey must be carried out by a suitably qualified engineer and to a set survey procedure. For any data centre or server room, the survey must be comprehensive in order to ensure that no critical infrastructure component that could prove to be a single point of failure is missed.
Most surveys start from the incoming point to the building and then follow the critical power and cooling route into the server room or data hall. Timing is important as the greatest heat images will be capture during peak operational and workload times i.e. there is little point carrying out a thermal survey during off-peak or maintenance periods unless there are suspect and aged systems such as old transformer-based UPS and battery sets in operation.
- Substation Transformers: the transformer may outside the building and the property of the local electricity district network operator (DNO) or be on the plant asset register of the data centre. It is important to monitor changes in thermal temperature and Delta-Ts annually with respect to windings and lug connections.
- HV/LV Switchboards: LV switchboards, like substation transformers can have a design life measured in decades but the switchgear will require some maintenance. Active harmonic filters for example will have capacitors that require replacement around every 7-8 years. Thermal temperature rises can indicate ageing components that require swap out.
- Electrical Wiring and Sub-distribution Panels:from the LV switchgear power is provided on sub-circuits via sub-distribution panels. These will have circuit breakers that and connecting cables that must be sized and rated correctly for the voltage and currents calculated. Higher than normal thermal temperatures can indicate overheating and underrating both of which can undermine discrimination and fault paths in the event of downstream short-circuit paths.
- Backup Power, AMF Panels and Static Transfer Switches (STS): most facilities will have some form of back-up power and may have a static transfer switch arrangement (A and B supplies). For accurate temperature measuring the back-up generators will need to ‘imaged’ during their standby and power-on operations as will the static transfer switches.
- UPS Systems and Batteries: the facility may have a large centralised uninterruptible power supply or decentralised power protection plan. Each UPS and their battery set (lead acid or lithium-ion) must be surveyed under load conditions.
- Energy Storage Systems: there is an increasing trend for larger operators to store power locally (from renewable power sources) or to generate revenue with demand side response (DSR) programmes. An energy storage system is like a UPS system and will have a lithium-ion type battery. Lithium batteries have more complex battery management systems than lead acid but will require thermal camera inspection at least annually to help identify potential issues.
- Power Distribution Units: within the server racks, PDUs provide the final power point of connection of the server and IT loads to the critical power paths. PDUs will experience potential thermal overloads than could lower their reliability when operated within server rack ‘hot-spots’ or when overloaded but within their thermal trip settings. Poor connections and faulty wiring can also be exposed by a thermal survey.
- HVAC Systems: the cooling system should surveyed with the same comprehensive approach as the critical power path and include each sub-component and external chillers and heat exchangers to expose any potential problems and failure points.
- Server Racks and Containment: thermal imaging can provide a quick and accurate assessment of the efficiency of cooling within server racks and containment systems. What must be considered here is the air intakes and exhaust areas in relation to the power densities and therefore heat generated by the servers themselves. HPC and blade type servers in highly dense deployments generate significant amounts of heat. The survey should help to identify whether the rack air flow or containment arrangement is operating as expected in terms of preventing the cold and hot-aisle air flows from mixing and potentially weakening the overall cooling design. Coupled with a measurement of the air flow can help to map out the overall air flow in and around the areas.
- Raised Access Floors and Ceiling Voids: under floor may hide a plethora of issues which are masked if the void is used for cooling and air flow. The survey should take pictures where possible to identify any thermal issues which could indicate cable damage, poor connections or simply poor layout of the areas.
Air flow design and thermal management are becoming increasingly complex within data centre and server room environments. Air flow and thermal temperatures issues can arise from changes in the design concept as new technologies are deployed, as well as due to ageing components within the electrical infrastructure. Thermal camera surveys are increasingly becoming more widely accepted either as separate thermal audits or as additions to preventative maintenance and fault-finding visits. Whilst the cameras are relatively low-cost devices, their use and application require formal training in order to ensure the survey is comprehensive and does not miss that single point of failure that could catastrophically fail and interrupt data centre operations.
Is the IT industry driven by technological developments or client needs? Sometimes it is not easy to define the drivers, but one thing is for sure. Innovation in the industry whether its for energy efficiency or scalability, cost reduction or power density, leads to the creation of e-waste or IT computers, servers, accessories, cabling, air conditioners, UPS systems and racks that need to be recycled and as much material as possible reclaimed for later reuse.