Vendors

Huawei targets AI data centre reliability with Xinghe AI Fabric 2.0

Huawei targets AI data centre reliability with Xinghe AI Fabric 2.0

As AI workloads reshape enterprise infrastructure, data centre networks are under increasing pressure to deliver higher performance, resilience and automation.

At Mobile World Congress Barcelona, Huawei expanded its Xinghe AI Fabric portfolio, positioning the technology as a foundation for AI-driven data centres. According to Arthur Wang (pictured), President of the Data Center Network Domain at Huawei’s Data Communication Product Line, the latest release – Xinghe AI Fabric 2.0 – reflects a shift in how networks must operate in the AI era.

“Data centre networks have moved from cloud and virtualisation to a new stage driven by AI,” Wang told Developing Telecoms. “Enterprises now want to fully unleash and efficiently use their computing power.”

A network architecture built for the AI era

Huawei’s AI Fabric architecture is structured around three layers: AI Brain, AI Connectivity and AI Network Elements.

Wang likens the design to the human body. The AI Brain layer acts as the decision-making centre, analysing network data and directing operations. Beneath it, AI Connectivity functions like a circulatory system, transferring data across the network. At the base, hardware related AI Network Elements provide the high reliable and intrinsic security foundation.

Huawei first introduced an AI Fabric concept in 2018, initially focused on reducing packet loss within networks. However, the rapid expansion of AI workloads since 2023 has forced vendors to rethink data centre networking entirely.

The new Xinghe AI Fabric 2.0 integrates four major components: Rock-Solid Architecture 2.0, StarryWing Digital Map 2.0, Xinghuan AI Turbo 2.0, and iFlashboot 2.0. Together, Huawei says these capabilities allow enterprises to build always-on AI data centre networks capable of maximising computing performance.

Tackling hidden network faults

One major challenge in large-scale data centres is identifying faults that are difficult to detect through traditional monitoring.

Wang describes these as “unknown faults” – issues that appear invisible during routine checks but still disrupt services. In some cases, network tests may show that connections are functioning normally, while applications remain unavailable.

“You can run a ping and everything looks fine,” Wang explained. “But the service has already been interrupted.”

These faults can be caused by factors such as hardware issues, abnormal routing entries or unexpected congestion. The difficulty lies in the scale of modern data centres. Large networks can contain hundreds of thousands of simultaneous traffic flows, yet most monitoring systems only analyse a small fraction due to hardware limitations.

Huawei claims its AI Eagle-Eye Engine, introduced as part of Rock-Solid Architecture 2.0, addresses this problem by monitoring up to 200,000 service flows in real time. Using AI-based analysis allowing the root cause of problems to be identified within minutes.

According to Huawei, this significantly reduces troubleshooting times that might otherwise take several hours.

Supporting multi-vendor networks

At the same time, enterprises are increasingly moving away from relying on a single networking vendor.

Analysts suggest that around 85% of organisations are expected to adopt dual-vendor strategies in the next two to three years, often to reduce costs and supply chain risk. But managing multiple vendors introduces operational complexity, particularly when faults occur.

Huawei’s answer is a unified management layer based on StarryWing Digital Map 2.0 and the iMaster NCE controller platform. The system provides a standardised model that allows operators to manage heterogeneous network environments through a single interface.

The platform integrates with common IT service tools and can communicate with equipment from other major vendors in industry.

Huawei says the same framework can extend beyond networking hardware to include third-party security infrastructure, allowing policies and configurations to be managed from a central platform.

AI tools are also used to analyse complex firewall rule sets. In large enterprise networks, thousands of policies can accumulate over time, making changes risky. Huawei’s system can analyse existing rules and warn administrators if new policies might conflict with existing ones.

Maximising GPU performance

AI workloads are also changing the economics of networking.

GPU-based servers used for AI training and inference are significantly more expensive than traditional CPU-based infrastructure. Any network inefficiency can therefore result in costly computing resources sitting idle.

“The GPU servers are extremely expensive assets,” said Wang. “If the network becomes congested, GPUs have to wait for data. That means the computing power is wasted.”

To address this, Huawei introduced new optimisation technologies as part of Xinghuan AI Turbo 2.0, including Network Stream Load Balancing (NSLB) and Network Packet Load Balancing (NPLB).

Traditional load-balancing methods in AI clusters often achieve only around 50% bandwidth utilisation. Huawei claims its new algorithms can increase effective throughput to 90–98%, helping to maximise GPU efficiency in both training and inference scenarios.

The technologies support GPUs from multiple vendors, including Huawei’s Ascend processors.

Liquid cooling for next-generation switches

Power consumption is another growing concern in AI infrastructure. As chip densities increase, the power draw of a single rack can exceed 30kW, making traditional air cooling increasingly ineffective.

To address this, Huawei launched the XH9230-LC liquid-cooled switch, designed specifically for AI data centre environments.

While liquid cooling for switch chips is already possible, cooling optical modules has proven more difficult because they must remain removable while maintaining efficient heat transfer.

Huawei says it solved this challenge using a flexible heat-dissipation structure built with thermally conductive interface materials (TIM). This allows optical modules to maintain close contact with cooling surfaces while still being easily plugged in and removed.

The company claims the design improves heat dissipation significantly, allowing up to eight switches to be deployed in a single rack while reducing air-conditioning requirements by as much as 60%.

Towards autonomous network operations

Looking ahead, Huawei believes automation will become increasingly important in data centre operations.

The company is developing an AI-driven operations platform called NetMaster, designed to act as an AI agent capable of automatically detecting and resolving network issues.

Working alongside the StarryWing Digital Map, NetMaster can identify faults, isolate problematic interfaces and reroute traffic without human intervention.

According to Wang, Huawei can already automate roughly 80% of network issues, though the remaining cases still require human oversight.

Fully autonomous networking will require extremely high reliability, he noted. But Wang believes the industry is not far from that goal.

“Maybe in two or three years,” he said. “But the system must be completely accurate before networks can fully trust it.”



More Articles you may be Interested in...