29/01/2021

How to Carry Out Server Room and Data Centre Risk Assessments

There are several hazards within a server room or data centre that can disrupt operations, lead to down time, and potentially cause personal injuries. A formal risk assessment process is a way to identify the hazards and implement control and monitoring measures to mitigate the potential risks. Risk assessments should be carried out by suitably trained personnel in order to comply with health & safety requirements and can assist in improving the overall resilience of a server room or data centre.

What is a Risk Assessment?

A risk assessment is a systematic process for evaluating and analysing the potential risks involved with a project, task, or work area such as a computer room, server room or data centre to people or groups of people and the implementation of control measures to reduce their risk profile. There are generally five steps to a risk assessment including:

  1. Identify the hazards: start with a walk around to review the hazards just outside and within the server room.
  2. Decide who might be harmed and how: for each of the identified hazards it is important to be clear who might be harmed and how as this this will help in identifying how to minimise the risk through control measures. It is best practice to consider people in groups and in a server room this could include IT engineers, facilities engineers, sub-contractors, cleaning staff, other groups within the organisation and visitors to the site. In assessing the groups of people, it is important to consider how they may be harmed.
  3. Evaluate the risks and decide on precautions: for each of the identified hazards, review what controls can be put in place to reduce or remove the risk impact of the hazard. For the groups identified this means according within the law to do everything ‘reasonably practicable’ to protect them from harm. Proposed controls can be compared to best practice and can include removing the hazard altogether. If the hazard cannot be removed, it is a question then of how a control or controls can be put in place to reduce the risk.
  4. Record your findings and implement them: it is important to document findings on a risk assessment template to record what was found and implemented for review.
  5. Review your risk assessment and update if necessary: risk assessment should be reviewed at least annually, during audits, should an incident occur and/or when there is a substantial change to the server room environment.

Risk assessments are normally generated from a standard risk assessment template and are then added to a document register. For most organisations with either ISO9001 (quality management) or OHSAS18001 or ISO45001 (health & safety management), the document register for the integrated management system (IMS) should be used. Each issue is noted in the document register include date and issue with an archive of previous document issues available for inspection as required. A completed risk assessment document should list:

  • The hazards identified
  • the people at risk
  • Precautions/Controls required to reduce the level of risk to the lowest practical level including Control Measures and Monitoring Measures
  • A Residual risk rating with the Probability and Severity rated from 1-10 and multiplied to provide an overall Risk Rating score (Probability x Severity = Risk Rating)

Risk Rating Table and Colour Bands

The table below shows the ratings 1-15 and the resulting risk rating and colour band for a risk assessment template.

Probability Severity Risk Rating (P*S) Level
1 Highly unlikely 1 Trivial 1 No action Grey
2 Unlikely 2 Minor injury 2-5 Low priority Green
3 Possible Over 3day injury 6-9 Medium priority Yellow
4 Probable 4 Major injury 10-14 High priority Orange
5 Certain 5 Incapacity or death >= 15 Urgent action Red

For more information on risk assessment visit the Health & Safety Executive website:
https://www.hse.gov.uk/simple-health-safety/risk/index.htm

Server Room System Availability Risk Assessments

Server rooms, and data centres house the critical IT, servers and networking devices required by an organisation to run its information systems. There are inherent risks no matter what the size of operation related to the amount of power being drawn by the IT devices, the heat generated and potential loss of power. System downtime can result a loss of service(s) and revenue(s), process and manufacturing problems, service level agreement penalties and negative publicity. Dependent upon the organisation, there could also be a risk to life.

A data centre or server room availability risk assessment provides a way to identify potential hazards, their impact, control and monitoring measures and overall risk rating for each identified hazard. One of the key conclusions to identify from the process is the presence of any single points of failure.

Please download our Generic Risk Assessment Template for Server Rooms and Data centres. This is an example copy as hazards, risks and the controls required will vary from site to site. A copy is available upon request in Excel format.

Server Room Hazards Checklist

Areas to cover within a server room availability risk assessment include the critical infrastructure systems including:

  1. Power: a review of the critical power path from the building incomer to the local server room sub-distribution panel and rack PDUs, with a view to energy efficiency and the Tier-rating (Uptime Institute) of the power protection plan including uninterruptible power supplies and local standby generators, A and B supplies within the rack and N+X power supplies. For each device, a review of the load connection versus capacity to identify potential growth issues or system overloads. Certifications to inspect including electrical circuit testing, maintenance, and service records (including battery testing and replacement dates), and the age of the devices and systems in place.
  2. Cooling: a review of the mechanical systems including HVAC systems in place within the server room and building in terms of energy efficiency, capacity versus load, parallel/redundant configurations, and Tier-rating. Certifications to inspect including maintenance and service records, and the age of the devices and systems in use. Within the server room rack arrangements and fit-out including the use of blanking panels could also be included here as the more efficient the cooling process, the lower the amount of energy used.
  3. Fire: a review of the local fire suppression system in place (if any) and its suitability for the equipment in use and rack layout and certificate inspections for annual maintenance service and room integrity testing. Server room or data centre fire suppression systems can be at the room or server rack level.
  4. Monitoring: what local monitoring is in place for the server room in terms of environment monitoring for rack level or room temperature, humidity, water leakage and smoke and detection. From an equipment point of view, which systems are monitored and how. UPS and air conditioning systems could be installed with SNMP cards to provide local IP network access and web-based remote monitoring. Alternatively, signal contacts could be used to provide alarm signals to local building management systems.
  5. Security: access control systems, keypads, biometrics, camera terminals, building CCTV and in-room motion detection cameras, guard services and security policies.
  6. Communications: assessment of the systems in use, cable routes, number of cable feeds, fibre services, demarks, carriers and in particular diversity and redundancy.
  7. Building and location: an assessment in terms of the local building area, the room shell, interior walls, ceiling voids, under floor plenums and of fire proof doors. Proximity to hazardous substance(s) storage or a flood plain, building plumbing and transport infrastructures should also be assessed for their potential risk(s).

The first three critical infrastructure areas require further comment in terms of hazards and risks.

Server Room Electrical Safety

The topic of electrical safety within a server room should also consider power availability and the planning for how to respond to a mains power outage or momentary interruption. Specific hazards to consider include:

  • Capacity: firstly, are all the critical IT loads powered from an uninterruptible power supply that are required for system availability during a power outage. A missed broadband router or network switch can lead a loss of IT services. Secondly, can the UPS system support the load. Whilst the UPS may report that the load is within its operating limits, ideally within 80% of the UPS rating, the battery set must also be capable of supporting the load for the planned for duration. Batteries should be inspected and replaced regularly. An ageing battery can report adequate capacity, but this can quickly collapse when placed under load. Blackout tests should be completed at least once a year. If there is a local standby power generator, this should be tested monthly. Intelligent PDUs powered from a UPS can also help with load and capacity planning as they can identify the power usage down to outlet socket level per PDU and server rack.
  • Availability and Resilience: UPS systems within a server room can be deployed as a centralised or decentralised system. For a centralised system, the UPS is sized to power all the critical IT loads within the computer or server room. In a decentralised installation, individual UPS systems are deployed at the server rack level. A decentralised approach can lead to more complex power management issues which can be overcome using UPS power management and control software to monitor load and UPS status alarms. A centralised UPS system can be more easily maintained and monitored. However, what is important is redundancy levels and ideally any UPS installed within a computer or server room or data centre should be installed as N+1. This allows the UPS systems to share the loads and should one require maintenance or swap out, the other should be capable of providing power outage cover to the entire critical IT system. Where servers with dual power supplies are installed, A and B power paths can be created using PDUs and static transfer switches for additional resilience. This may be as far as is reasonably practical for a smaller computer room or server room in terms of resilience. Data centres on the other hand may also have N+X in their local standby power generators and derive their local mains power supplies from two separate connections to the National Grid (Tier-III data centre rating by the Uptime Institute).
  • LV Switchboards: overloaded switchgear can lead to a potential failures and fire risks within the electrical distribution system, as can harmonics. Harmonics are over-voltages or over-currents produced by non-linear loads including the types of power supplies used in IT devices. Thermal heat guns can be used to survey the electrical system including system components such as transformers, switchgear, UPS systems and their battery sets to identify significant areas of heat build-up as part of preventative maintenance routines.

Server Room Cooling System Failures and Hazards

There are several potential hazards when it comes to server room cooling:

  • Capacity: is the cooling system design sufficient to support the cooling load during even the hottest of days. It is not uncommon during heat waves for additional cooling to be required. This can also occur if the cooling system requires a power down for routine or emergency maintenance.
  • Availability and Resilience: small comms room may only have a single air conditioner installed. For a server room a dual system is recommended with operational cycling every 7 days. Should one of the air conditioners fail or be taken out of service, the remaining unit should be able to support the complete cooling load.
  • Humidity Levels: humidity levels affect static electricity and there are too hazards to consider. If the air is to dry, static electricity can build-up leading to a potential discharge and electrical spark. If the air is too humid, moisture can build-up when cooled air hits warmer metal areas including server racks. The build-up of moisture droplets can lead to a short circuit and fire risk.
  • Remote Monitoring: many computer and server rooms run unattended on a 24/7 basis. If there is no alarm reporting for the cooling system, an air conditioning system failure can lead to a rapid build-up of heat and lead to a fire. Most air conditioners can be installed with a volt-free signal contact or SNMP type interface card to allow alarms to be notified to a building management system or broadcast over an IT network. Additional environment monitoring devices can be installed to monitor for temperature, humidity and airflow and report alarms to an email or SMS text alert distribution list.

Server Room Fire Safety Hazards

Within a server room there are several potential fire sources including:

  • Electrical Equipment Failures: whilst most of the focus is on the IT servers and peripherals within a computer or server room, all electronic devices and power related distribution systems should be inspected and regularly tested. Electronic components age and batteries can suffer heat damage, leading to electrical short circuits and potential fires. Outlets and plug socket overloading can be common problems in IT spaces that have expanded with little if any planning.
  • Hot-spots and Cooling System Failures: within server racks, hot-spots can build up due to poor equipment layout, which can lead to a risk of fire, if there is a sudden rise in room temperature. Poor rack layout and general obstacles to efficient air flow within the room can also create hot spots.
  • Raised Access Floor and Plenum Areas: under floor areas can hide a multitude of potential problems including poorly laid cables, damaged floor tiles and areas of overheating that can lead to a fire risk. Water ingress may also be an issue if the floor is also used for liquid cooling pipe supply and returns. Suspended ceiling can also be an issue of the ceiling void is used for cable runs (power and/or comms) that cannot be easily inspected.
  • Poor Cleaning Regimes: IT equipment collects dust as air is generally drawn in through the front of the device and expelled through the rear ventilation fans. Within the air of dust particles can build-up from activity within the room and settle in hard-to-reach areas, leading to a heat build-up. Waste materials and packing left within a comms room or server room or even a data centre build-out and equipment service area can also lead to a fire risk. Waste bins should be emptied regularly, and packing removed to a safe area outside the room. If there is a spark in the room, from a short-circuit or build-up of static electricity, the debris and waste materials can help to fuel a fire.

An additional standard check may also be carried to verify the information security management systems in place which can include ISO27001 and Cyber Essentials.

Once complete, the server room risk assessment availability document should help to identify the most critical areas for urgent review (orange and red bands). Improvement actions could include developing projects for infrastructure improvements through policy and process reviews or upgrades to existing hardware and software. An action plan should document and prioritise the actions and their scheduled due dates and the people responsible and be presented for approval by the appropriate bodies.

Summary

Carrying out risk assessments within a server should be a mandatory requirement for any organisation who cannot operate without their IT system. The risk assessment should be reviewed annually or if there is an incident or significant change in critical infrastructure systems or their usage. The risk assessment should feed into business continuity planning to ensure that an organisation can continue to operate should there be a major disruption that causes the business continuity plan to be activated.

Our projects team carry out risk assessments across the UK and work with organisations to develop robust business continuity plans. Please contact us if you would like to receive more information on our risk assessment services for server rooms and data centres or would like to receive a copy of our server room risk assessment template in Excel format.

Blog main image 1788271 1614335961

Related blog posts

12/02/2021
Prev Article
How to Prevent Comms Room Power Outages Using Business Contintuity Principles
Blog box fixed 1801191 1613250712

Whether you operate a comms room or server room it is important to ensure you have a power protection plan in place that will prevent unplanned for downtime to a power outage. Many IT networks expand rapidly, and the rooms and racks used to house their critical servers and network devices can quickly become cluttered. This can lead to several health & safety issues from trip to fire hazards. It can also lead to single-points-of-failure in terms of power continuity planning as some new devices may not be supported when there is a mains power supply failure.

Read more ...
28/08/2020
Next Article
How IT Asset Management Services Improve Business Continuity Planning
Blog box fixed 1509926 1607528581

Which IT Asset Management (ITAM) tracking system do you use for your network devices? It may or may not surprise you that many smaller computer and server room operators use spreadsheets to track their IT assets. Whilst this approach may be fine for smaller IT operations with assets running into the 10s, it is not efficient and can lead to out-of-date information, duplicates, inaccurate serial numbers and tag overlaps. The average error rate for spreadsheet-based IT asset management is around 15%. Larger facilities and datacentres have to take a more software-based approach due to the number of IT assets involved and some use smart tagging systems to track and improve physical asset security.

Read more ...