With AI business applications on the rise, organisations in every sector are getting ready to invest in infrastructure to support the productivity advantages they represent.
But these AI applications are resource-intensive to say the least. The high-density computing they require involves power supply and cooling systems that can adapt to specific consumption and cooling demand profiles. A new generation of dedicated power and cooling systems needs to be designed, operated and - crucially - maintained, with a renewed focus.
Delivering the infrastructure to support the latest GPU (graphics processing unit) designs that drive these applications is covered by product manufacturers, design consultants and specialist contractors. However, the operational and maintenance requirements should also be considered at the planning stage.
High-density computing applications have two main characteristics in terms of power consumption. First, although advances in server hardware have resulted in higher energy efficiencies per data unit, AI applications require more energy for data-intensive processes, such as model training and real-time inference. Second, the dynamic consumption profile varies depending on factors such as the type of tasks being executed, system configuration and GPU architecture.
These extra energy needs challenge thermal solutions, requiring both air and liquid to manage the heat these new servers produce. Enhancing air-cooled systems with rear door or direct-to-chip liquid cooling demands that operators manage dual thermal solutions and reduce system thermal inertia to meet SLA (service-level agreement) temperatures. The deployment of fluid-based cooling necessitates leak detection, regular fluid sampling (potentially with hazardous chemicals) and “smart hand” server swap capabilities.
Data centre operations managers traditionally depend on events and alarms from various management systems such as BMS (Building Management Systems) and EPMS (electrical power management systems). This operational data supports daily activities but lacks real-time, product-specific information for detailed analytics. As a result, energy inefficiencies may be missed, potentially resulting in increased costs and hindering company sustainability goals. Crucial real-time data, such as fluid flow and pressure rates, is limited, and advanced predictive analytics for equipment maintenance are absent.
Traditional power and thermal infrastructure maintenance methods are time-based and involve routine checks, regardless of the actual condition of the equipment. This approach introduces risk in manual switching and may lead to unforeseen downtime, inefficient operations and unidentified risks. In high-density compute applications where operation depends heavily on infrastructure performance – including components such as coolant, valves, pumps, filters, batteries, fans and capacitors – the use of enhanced data analytics is essential to provide an effective solution.
Leading critical digital infrastructure equipment manufacturers such as Vertiv provide holistic services that meet the requirements of a new generation of power and cooling systems to support AI applications. The services offer hands-on experience during commissioning and operation, along with real-time data collection from critical assets. Some new services are able to securely transmit equipment health data to a cloud platform, where AI and machine learning analyse it to produce health scores and detect anomalies.
With access to detailed analytics and reporting tools, informed strategic decisions can be made. The enhanced visibility of operations increases confidence among stakeholders and supports operational efficiency.
Typical analytics include the following:
Site health score. This identifies both healthy and potentially problematic locations, and benchmarking equipment installed in similar electrical and physical environments to inform maintenance tasks and frequency.
Equipment health score. Various subsystem and component health scores are tabulated to enable accurate maintenance regimes and life-cycle replacement strategies.
Critical alarms. Alarms needing immediate attention by operational teams are highlighted.
Number of critical alarms. Sites with frequent and/or specific types of critical events are highlighted for better preparation and prevention (for example, stock parts).
To reduce financial, technical and safety risks in the increasing complexity of modern data centres (especially with the demands of AI and high-performance computing), scalable and adaptable maintenance solutions are required, including the following:
A maintenance approach that recognises potential issues before they occur by continuously measuring the health of critical infrastructure systems, identifying anomalies, alerting to health events and enabling lifecycle maintenance aligned with actual equipment needs.
Advanced incident management support offering troubleshooting, root cause analysis and incident response by linking critical systems with expert engineers. Real-time data is monitored to spot trends, predict behaviours and address anomalies. Issues can be resolved remotely or by dispatching a field engineer. This connected service optimises equipment performance and maximises availability.
Customer portals offer an intuitive, cloud-based interface for easy access to data centre asset information and graphical representations of the rapid or gradual declines in equipment health scores. With comprehensive dashboards, users can quickly make informed decisions, improving efficiency and reducing downtime risks.
by Mike O’Keeffe, Senior Vice President, Vertiv