Skip to main content

[@ServeTheHomeVideo] Lambda AI Cluster Tour with Supermicro NVIDIA HGX B200 ft. @ServeTheHomeVideo

· 6 min read

@ServeTheHomeVideo - "Lambda AI Cluster Tour with Supermicro NVIDIA HGX B200 ft. @ServeTheHomeVideo"

Link: https://youtu.be/KGRk03PO8KI

Short Summary

Here's the requested information, based on the transcript:

  • Number One Action Item/Takeaway: Leverage Lambda's "one-click" AI cluster solution to easily and quickly access powerful computing resources (ranging from 16 to thousands of B200 GPUs) for AI development and production, bypassing the complexities and costs of building and maintaining your own infrastructure.

  • Executive Summary: The video showcases Lambda's state-of-the-art AI cluster featuring thousands of NVIDIA B200 GPUs housed in Super Micro servers, highlighting its powerful infrastructure and the "one-click" access solution for scalable AI computing. Lambda's cluster simplifies access to significant GPU resources for AI teams.

Key Quotes

Here are four direct quotes from the YouTube video transcript that I found to be particularly insightful:

  1. "Lambda's Columbus data center represents our second expansion in the Midwest. Ultimately, we're going to be growing into the hundreds of megawatts into 2026 and starting to deploy liquid friendly chip technology as well." (Highlights the scale and future plans of Lambda's infrastructure.)
  2. "The NVME SSDs contained in these arrays are physically smaller than today's hard drives, but they're also much higher performance and have higher capacities than those larger hard drives. This is important since the job of this network storage is to deliver pabytes of data to the GPUs as fast as possible." (Explains the crucial importance of fast storage in AI clusters.)
  3. "Each GPU in this server gets a nick to handle its east west network traffic." (Explains the purpose of the vast ammount of ethernet ports)
  4. "Unlike a standard server where you have power supplies that are built into the server, you actually have the power supplies that sit in power supply shelves on the top and bottom of racks...you literally need liquid cooling to be able to get that density and make this entire rack operate as a single GPU." (Explains why systems like the Nvidia GB200 NVL72 require integrated solutions like liquid cooling)

Detailed Summary

Here's a detailed summary of the YouTube video transcript, presented in bullet points, excluding sponsor announcements:

Overview of Lambda's AI Cluster in Columbus, Ohio:

  • The video tours a Lambda AI cluster in Columbus, Ohio, hosted by Cool Logix.
  • It's one of the first NVIDIA B200 GPU clusters available.
  • Users can start a cluster with "one click", ranging from 16 to over 1,500 GPUs.
  • The cluster includes a vast storage array, reaching tens of petabytes.
  • Lambda offers both on-demand instances/clusters and private cloud solutions for larger deployments (50k-100k GPUs).
  • The data center represents Lambda's second expansion in the Midwest with plans to grow into the hundreds of megawatts into 2026 and deploy liquid cooling chip technology.
  • The current installation has several thousand B200 GPUs and is growing.

Cooling Infrastructure:

  • The data center is air-cooled, although liquid cooling will be standard in the next 18 months for AI clusters. The Super Micro HGX B200 servers can still thrive in air cooled environments, allowing for faster deployment without plumbing.
  • Large heat exchangers (similar to car radiators) are on each side of the cluster to efficiently extract heat.
  • These exchangers work with chillers on the roof (not shown in the video).
  • The facility is a 36-megawatt installation with its own substation.
  • It includes Rolls-Royce generators and battery backup/UPS systems.

Data Center Design:

  • Utilizes a cold aisle/hot aisle design.
  • Air is pulled through the GPU servers, and hot air is contained in the hot aisle.
  • Hot air rises and is circulated through the heat exchangers, cooling it for reuse.

Storage Infrastructure:

  • All-flash array powered by VAST Data.
  • VAST focuses on software, and the NVMe storage nodes are Super Micro 1U servers.
  • The array provides fast and dense storage.
  • NVMe SSDs are smaller but offer higher performance and capacities than traditional hard drives.
  • The array's purpose is to deliver petabytes of data to the GPUs quickly, preventing idle time.
  • High density allows for tens of petabytes of storage in a smaller footprint than hard drive arrays.

Networking Infrastructure:

  • Nvidia Quantum 2 networking based on NDR Infiniband running at 400 GB per second.
  • Each Super Micro HGX B200 server has eight Nvidia ConnectX7 NICs (one per GPU).
  • Fiber connects servers to the scale-out GPU fabric.
  • Includes an Ethernet-based fabric.
  • Each server also has an Nvidia Bluefield 3 DPU with 16 cores, 16 GB of memory, and its own OS to offload networking tasks.
  • DPUs drive 400 GB of network bandwidth and have built-in accelerators.
  • High-end NICs are essential for serving multiple customers in the cluster, increasing network load.

Super Micro HGX B200 8-GPU Platform Details:

  • One "GPU" provisioned is actually the entire Super Micro server.
  • The server contains eight Nvidia B200 GPUs, CPUs, memory, and networking.
  • The tray holds the GPUs, NVLink switches, and PCIe timers, cooled by heatsinks (air-cooled in this data center).
  • The host system provides IO expansion, boot drives, local AI storage, and cooling for the CPUs and memory.
  • Two boot SSDs are used for the operating system.
  • North-south network ports are provided by the Nvidia Bluefield 3 DPU.
  • Local AI storage is available through SSDs.
  • Fans cool the dual CPUs and system memory.
  • Front VGA and USB ports are accessible on the cold aisle for servicing.
  • The back of the system features fans, six 5250W titanium-level efficiency power supplies (with two power inputs per supply), and a NIC tray.
  • Two standard NIC ports and one management port are provided, along with eight Nvidia ConnectX NICs for Infiniband.
  • The system weighs approximately 286 lbs (130 kg).

Nvidia GB200 NVL72 Rack System (Comparison):

  • The video briefly discusses the liquid-cooled Nvidia GB200 NVL72 racks.
  • These racks are fully integrated and use more power, requiring different facilities.
  • Lambda also deploys these racks in production clusters (example in Mountain View).
  • The GB200 NVL72 integrates 72 Blackwell GPUs and 36 Grace CPUs.
  • Features three types of networks (Bluefield 3 for north-south, Infiniband/Ethernet for east-west, and NVLink for GPU interconnect).
  • Power supplies are located in shelves at the top and bottom of the rack.
  • Liquid cooling is essential for the high density and performance.
  • Racks are built and tested by Super Micro and delivered fully assembled.

Lambda's "One-Click" AI Cluster Solution:

  • Provides an easy way for machine learning teams to scale from 16 to thousands of GPUs.
  • Allows for faster transition from R&D to production.
  • Avoids delays associated with direct sales interactions.