Supermicro NVIDIA HGX B200 Systems Demonstrate AI Performance Leadership on MLPerf Inference v5.0 Results

Super Micro Computer introduced new systems based on NVIDIA’s HGX B200 8-GPU platform, achieving leading performance across several MLPerf Inference v5.0 benchmarks. Both 4U liquid-cooled and 10U air-cooled configurations demonstrated performance improvements, with more than three times (3X) the tokens per second achieved for Llama2-70B and Llama3.1-405B benchmarks compared to previous-generation H200 8-GPU systems.

Supermicro’s SYS-421GE-NBRT-LCC (liquid-cooled) and SYS-A21GE-NBRT (air-cooled), each equipped with eight NVIDIA B200-SXM-180GB GPUs, showed benchmark performance leadership. In the Mixtral 8x7B (Mixture of Experts) benchmarks, the systems achieved 129,047 tokens/second under server conditions, and 128,795 tokens/second offline. Additionally, performance with the Llama3.1-405B model exceeded 1,000 tokens/second for an 8-GPU node, significantly higher than previous GPU generations. For smaller inferencing tasks such as LLAMA2-70B, the NVIDIA B200 SXM-180GB-equipped system achieved the top performance among Tier-1 suppliers.

Notable benchmark results include:

– Stable Diffusion XL (Server): 28.92 queries/second
– Llama2-70B interactive server benchmark: 62,265.70 tokens/second
– Llama3.1-405B offline inference (liquid-cooled): 1521.74 tokens/second
– Llama3.1-405B server inference (air-cooled): 1080.31 tokens/second for an 8-GPU node
– Mixtral 8x7B Server/Liquid-cooled: 129,047.00 tokens/second
– Mixtral 8x7B Offline/Liquid-cooled: 128,795.00 tokens/second

David Kanter, Head of MLPerf at MLCommons said, “MLCommons congratulates Supermicro on their submission to the MLPerf Inference v5.0 benchmark. We are pleased to see their results showcasing significant performance gains compared to earlier generations of systems. Customers will be pleased by the performance improvements achieved which are validated by the neutral, representative and reproducible MLPerf results.”

The HGX B200 8-GPU systems utilize new cooling technologies. The liquid-cooled variant incorporates new cold plates and a 250 kW coolant distribution unit (CDU), offering more than double the cooling capacity of earlier generations in a 4U chassis. Available rack configurations — 42U, 48U, or 52U — feature vertical coolant distribution manifolds (CDMs) to save rack space, enabling up to eight systems (comprising 64 NVIDIA Blackwell GPUs) per 42U rack, or up to twelve systems (96 GPUs) per 52U rack. The new air-cooled 10U chassis has been redesigned with greater thermal capacity, supporting eight 1,000W TDP Blackwell GPUs. Supermicro says this air-cooled option maintains the same density (up to four 10U systems per rack) as previous generation products, while delivering up to 15 times greater inference and 3 times greater training performance.