01/11/2020

How System Availability Requirements Drive Data Centre Design

Data Centre Design Drawing

Availability is a key measure for a server room and data centre design. The key performance indicator is a percentage of the time an IT facility is operational, compared to the total period being measured and is often express in ‘nines’.

What is Server Room or Data Centre Availability?

Availability is not only a measure of operational performance but a reflection of operation and maintenance efficiency but also server room and data centre design. In particular, availability considers the critical infrastructure systems and paths of delivery to the IT load. These critical data centre systems include uninterruptible power supplies (UPS) and standby power generators, and the cooling systems in place.

Whilst availability is often a key performance measure for hyper to large and medium scale data centres this is not always the case for smaller data centres and server rooms. Yet, there are design and planning principles that could be adopted, and which can help to deliver a greater ability to ride through power outages, cooling problems and system downtime.

Availability Calculation and Nines Classification

Everything has a risk and calculating availability for a server room or data centre design provides a way to manage this. Availability combines two operational and maintenance metrics often quoted on datasheets. These are the Mean Time Between Failure (MTBF) and Mean Time To Repair (MTTR). The relationship as a formula is:

Availability = ((1- (MTTR/MTBF)) x 100) = % figure

For any data centre design or server room system, it is possible to calculate availability as a percentage and express this in terms of downtime per year, month, week and days.

Availability Level Downtime Year Month Week Day
90% one nine 36.53 days 73.05hours 16.80hours 2.40hours
99% two nines 3.65days 7.31 hours 1.68hours 14.40 minutes
99.9% three nines 8.77hours 43.83minutes 10.08minutes 1.44minutes
99.99% four nines 52.60minutes 4.38minutes 1.01minutes 8.64seconds
99.999% five nines 5.26minutes 26.30seconds 6.05seconds 864.00milliseconds

99.9% may sound like a very achievable availability but in fact equates to a potential downtime of 1.44minutes per day.

If you run a co-location or critical server room operation, this amount of risk may be too high, and a way found to increase the availability to 99.99% or higher. Risk can only be reduced and mitigated for. Even at ‘nine nines’ which can be represented as 99.9999999% the daily downtime risk for a data centre equates to 86.4 microseconds.

Uptime Institute Tier Levels

The Uptime Institute is key industry body in the data center industry and will certify designs and operations to one of four Tier levels. Holding a specific certification provides the data centre with a way to audit its design & operation and provide customers with a measure of its downtime avoidance strategy.
There are four Tier-levels.

Tier I Design

A Tier I data centre is typically the design adopted by many server rooms operators. At this level, the critical infrastructure deployed will support the IT servers, networks and peripherals via a single path but is not designed for unexpected infrastructure failures.

Tier I provides some protection from human error. There is little if any redundancy designed into the critical power path and critical cooling systems. If comprehensive preventative maintenance is required, the IT systems must be shutdown. If critical systems manufacturer’s maintenance regimes are not respected, there is an even greater risk of disruption and unplanned for outages.

Tier Delivery Paths Redundancy Maintenance Fault Tolerant
1 1 No No No

The characteristics of a Tier 1 data centre include:

  • Uptime: 99.671% uptime per annum, the threshold level for a tier-grading by the Uptime Institute
  • Downtime: no more than 28.8hours of downtime per annum
  • Zero Redundancy: the facility does not gave redundant paths in its power or cooling systems

A Tier 1 data centre guarantees two-9s availability at 99.671% and guarantees ‘two-nines’. If there is a need for comprehensive maintenance, the facility may require a complete shutdown. Individual system maintenance within the power and cooling paths could disrupt IT operations.

Tier II Building

Tier II data centres have multiple redundant components in the critical power path and critical cooling systems and provide greater opportunities for maintenance. As with a Tier II facility, if there is an expected shutdown event, system availability suffers.

Tier Delivery Paths Redundancy Maintenance Fault Tolerant
1 1 No No No
2 1 N+1 No No

The characteristics of a Tier II data centre include:

  • Uptime: 99.741% uptime per annum
  • Downtime: no more than 22hours of downtime per annum
  • Partial Redundancy: power and cooling paths have some redundancy but the systems are not fault tolerant – the overall facility is not fault tolerant

A Tier 2 data centre has improved availability over a Tier 1 building due to the partial redundancy in its power and cooling path. The data centre guarantees two-9s availability at 99.741% and guarantees ‘two-nines’.

From an Uptime Institute Tier-rating point of view, most server rooms could be Tier 1 or 2. Computer rooms are typically Tier 1.

Tier III Colocation Data Centers

The Tier III level is the typical standard for co-location data centres providing public and private cloud services. At Tier III, data centre systems are concurrently maintainable as the power and cooling paths have multiple redundant components and distribution paths to the IT system loads. There should be no need for system shutdown for maintenance or system replacement. Tier III builds on Tier II to prevent the need for IT system shutdown.

Tier Delivery Paths Redundancy Maintenance Fault Tolerant
1 1 No No No
2 1 N+1 No No
3 1 Active / 1 Passive N+1 Concurrent No

The characteristics of a Tier III data centre include:

  • Uptime: 99.982% uptime per annum
  • Downtime: no more than 1.6hours of downtime per annum to allow for maintenance and emergency issues
  • N+1 Fault Tolerant Redundancy: to allow for planned maintenance and emergency response to problems that could affect operations but there could still be problems delivering custom-facing services
  • 72hours of Power Outage Protection: the data centre has at least three days of exclusive power available in the form of UPS battery sets and standby power generators, not external sources

A Tier 3 data centre has an availability of 99.982% and guarantees ‘three-nines’.

Tier IV Data Centres

What separates a Tier IV from a Tier III data centre is the use of several independent and physically isolated systems to provide redundant capacity and distribution paths. The separation provides protection from planned and unplanned events and add fault tolerance to a Tier III design. The critical power and cooling systems must be fault-tolerant and continuous. An example of system separation would be two incomers to the building from two sub-stations each with their own separate connections to the local national grid and with ‘local’ representing several miles (e.g. 100miles).

Tier Delivery Paths Redundancy Maintenance Fault Tolerant
1 1 No No No
2 1 N+1 No No
3 1 Active / 1 Passive N+1 Concurrent No
4 Multiple 2N Concurrent Yes

The characteristics of a Tier IV data centre design include:

  • Uptime: 99.995% uptime per annum
  • Downtime: no more than 26.3minutes of downtime per annum with maintenance and emergency issues having zero effect on customer service delivery
  • 2N+1 Full Redundant Systems: providing two times each system within the power and cooling paths including external grid connections
  • 96hours of Power Outage Protection: the data centre has at least four days of exclusive power available in the form of UPS battery sets and standby power generators, not external sources
  • Zero Points of Failure: every process and data stream has redundancy and no single outage or error can interrupt service delivery or shut-down the facility

A Tier 4 data centre has an availability of 99.995% and guarantees ‘four-nines’.

Tier Level Planning and Service Level Agreements

The difference in cost to design and build a Tier IV data centre compared to a Tier III data center can be substantial. Where a data centre at Tier III (99.982% availability) needs to support 99.99% availability (or higher) for a specific service level agreement, virtualised servers, storage, disaster recovery solutions and multiple load sharing data centre locations may be available. Tier IV data centres may be able to offer 99.999% (five-nines) available services through similar specific configurations.

More information: https://uptimeinstitute.com/tiers

Server Room Availability and Customer Facing Services

A Tier 3 or Tier 4 approach will be beyond budget reach for most smaller IT operations, computer and server rooms.

For these sites, the focus should be on a Tier 1 or Tier 2 configuration for their critical infrastructure systems dependent upon the services they provide and acceptable levels of downtime. Steps can be taken to provide these with individual system redundancy and maintainability, but they still only provide one path to the IT load.

Power

A centralised uninterruptible power supply should be installed with an external maintenance bypass to allow the UPS system to be maintained without disruption to the load. Of course, if there is a mains power supply failure during maintenance, there is a risk of system downtime. This can be mitigated against by adopting outside working hours preventative maintenance visits, but this may be a luxury if there is a need to service the UPS during working hours due to an emergency failure. If the budget allows, a modular UPS could be installed (instead of a monoblock system) to provide N+1 redundancy and scalability improving the availability during maintenance through the use of ‘hot-swap’ modules.

Cooling

Installing two load-sharing wall mounted air conditions. Under normal operations the air conditioners share the load or can be set to cycle mode where they each operate one-week-on and one-week-off. The arrangement is N+1 so that if one AC unit fails, the other can pick up the total load. The same applies if one AC unit is taken out of service for maintenance.

Monitoring

Other critical infrastructure systems to consider include room and rack-level environment monitoring. The two most monitored environmental factors are temperature and humidity. Sudden changes in these can indicate cooling system failure. Rack-level monitoring can help to identify hot-spots which can present a fire risk and reduce hardware reliability.

Fire Suppression

One of the key critical systems often overlooked is that for fire suppression. Electrical equipment power demands can easily overload sockets and connections in poorly designed and managed facilities. Electronic systems can easily overheat and especially so if there is a hot spot within a rack or a failing of a cooling system. Cabling in areas under raised access floors can be damaged, overheat and suffer short-circuits, often with little sign until there is smoke and fire evident. With most facilities running 24/7, there are long periods without staff on-site and this increases the potential for a catastrophic downtime event. Complete room fire suppression systems may not be feasible for small organisations but fire suppression can be put in place for racks using rack mount systems or rack cabinet fire suppression systems.

Designing A Data Centre or Server Room

Whether designing a new data centre or upgrading a server room, it is important to take an overview and consider availability and risk mitigation. Whilst the designs for smaller computer and server rooms may not have the budget of a Tier III or Tier IV data centre, they can adopt some of the design and operation principles of Tier 1 and Tier II facilities. For these sites, steps that can be taken to reduce risk and build redundancy into critical infrastructure systems including uninterruptible power and air conditioning systems.

Additional low-cost monitoring systems can also be deployed which can have a direct impact by monitoring for factors that can indicate design issues, system failures and areas for concern. The overall result will be a resilient IT facility, that with regular preventative maintenance and suitable timed hardware refresh, can provide the levels of availability required to support the service level agreements provided to the organisation’s own customers.

Please contact our projects team for more information on our data centre design services.

comments powered by Disqus

Related blog posts

15/02/2020
Next Article
A Preventative Maintenance Checklist for Datacentre UPS Batteries

Almost every business and organisation will have an IT element that will be protected with an uninterruptible power supply. Whether it’s a server room or data centre UPS or smaller system protecting a single server or IoT element, the amount of runtime available will depend on the ‘health’ of the UPS batteries.

Read more ...