Data centres and server rooms provide a managed and protected environment for server operations. Within these environments are complex infrastructure systems that must be inspected and maintained regularly but just how do you spot problems before they affect operational resilience? Regular maintenance and consumable replacement are one answer but even this can miss potential issues. The answer is a thermal camera survey.
A thermal camera is also known as an infrared (IR) camera and captures what cannot be seen with the naked eye i.e. an image of the radiation being emitted by the different infrastructure components within a building. Camera images can be taken at the building incomer, substation transformer, LV switchboard and all the way along the critical power path to uninterruptible power supplies, batteries (VRLA and lithium) power distribution units and electrical sub-distribution and wiring. HVAC (heating and ventilation air conditioning) system can also be photographed including chillers and cooling units as well as server racks and containment arrangements.
For some organisations a thermal camera survey may be mandatory e.g. as part of the annual insurance review or certification status such as the Uptime Institute’s Tier-rating system. For others the addition of a thermal camera survey to a preventative maintenance visit can provide more peace of mind and especially where older systems are deployed that may be approaching their design or useful end-of-life.
It is important to follow a set and documented procedure when carrying out a thermal camera survey. The survey itself can be a standalone service or one coupled to another such as a preventative maintenance visit, data centre audit or risk assessment project.
In the data centre world the term ‘hot-spots’ is often used to refer to high temperature areas within a server rack. This can often be caused by several factors including poor air flow management and the arrangement of servers and UPS systems within the rack.
In the electrical work the term is also used in the same way but may not necessarily result from poor equipment layout or air flow. ‘Hot-spots’ in a valve regulated lead acid (VRLA) battery or set of AC or DC capacitors indicate ageing or poor manufacture. The heat rises due to areas of internal resistance that if not dealt with could lead to a potential fire risk. Within electrical switchgear high temperatures can indicate load imbalances, underrated devices and electrical harmonics. Again issues that if not tackled can lead to fire risks and system breakdowns.
Thermal imagers such as a Fluke camera, measure actual surface temperatures and can store two-dimensional images of an object for comparative purposes. Captured images can then be used to identify temperature anomalies and areas that are
either hotter or colder than others around them or than expected.
As well as identifying ‘hot-spots’ the images can be stored digitally in a Cloud service and/or submitted with a visit report. The benefit of retaining the images being that they can provide a thermal audit record for changes in temperature over the life of an asset or component within the building’s infrastructure. Changes and anomalies can identify a need for investigation, maintenance or system upgrade or swap-out.
Any survey must be carried out by a suitably qualified engineer and to a set survey procedure. For any data centre or server room, the survey must be comprehensive in order to ensure that no critical infrastructure component that could prove to be a single point of failure is missed.
Most surveys start from the incoming point to the building and then follow the critical power and cooling route into the server room or data hall. Timing is important as the greatest heat images will be capture during peak operational and workload times i.e. there is little point carrying out a thermal survey during off-peak or maintenance periods unless there are suspect and aged systems such as old transformer-based UPS and battery sets in operation.
Air flow design and thermal management are becoming increasingly complex within data centre and server room environments. Air flow and thermal temperatures issues can arise from changes in the design concept as new technologies are deployed, as well as due to ageing components within the electrical infrastructure. Thermal camera surveys are increasingly becoming more widely accepted either as separate thermal audits or as additions to preventative maintenance and fault-finding visits. Whilst the cameras are relatively low-cost devices, their use and application require formal training in order to ensure the survey is comprehensive and does not miss that single point of failure that could catastrophically fail and interrupt data centre operations.
There are several hazards within a server room or data centre that can disrupt operations, lead to down time, and potentially cause personal injuries. A formal risk assessment process is a way to identify the hazards and implement control and monitoring measures to mitigate the potential risks. Risk assessments should be carried out by suitably trained personnel in order to comply with health & safety requirements and can assist in improving the overall resilience of a server room or data centre.
Is the IT industry driven by technological developments or client needs? Sometimes it is not easy to define the drivers, but one thing is for sure. Innovation in the industry whether its for energy efficiency or scalability, cost reduction or power density, leads to the creation of e-waste or IT computers, servers, accessories, cabling, air conditioners, UPS systems and racks that need to be recycled and as much material as possible reclaimed for later reuse.