Meta’s open AI hardware vision

Gaylord Contreras

October 16, 2024

0 Views

SaveSavedRemoved 0

Contents hide

1 Introducing Catalina: Open Structure for AI Infra

2 The Grand Teton Platform now helps AMD accelerators

3 Open Disaggregated Scheduled Material

4 Meta and Microsoft: Driving Open Innovation Collectively

5 The open way forward for AI infra

On the Open Compute Venture (OCP) World Summit 2024, we’re showcasing our newest open AI {hardware} designs with the OCP group.
These improvements embody a brand new AI platform, cutting-edge open rack designs, and superior community materials and parts.
By sharing our designs, we hope to encourage collaboration and foster innovation. In case you’re keen about constructing the way forward for AI, we invite you to interact with us and OCP to assist form the following era of open {hardware} for AI.

AI has been on the core of the experiences Meta has been delivering to individuals and companies for years, together with AI modeling improvements to optimize and enhance on options like Feed and our ads system. As we develop and launch new, superior AI fashions, we’re additionally pushed to advance our infrastructure to assist our new and rising AI workloads.

For instance, Llama 3.1 405B, Meta’s largest mannequin, is a dense transformer with 405B parameters and a context window of as much as 128k tokens. To coach a big language mannequin (LLM) of this magnitude, with over 15 trillion tokens, we needed to make substantial optimizations to our total coaching stack. This effort pushed our infrastructure to function throughout greater than 16,000 NVIDIA H100 GPUs, making Llama 3.1 405B the primary mannequin within the Llama collection to be educated at such an enormous scale.

Previous to Llama, our largest AI jobs ran on 128 NVIDIA A100 GPUs. However issues have quickly accelerated. Over the course of 2023, we quickly scaled up our coaching clusters from 1K, 2K, 4K, to finally 16K GPUs to assist our AI workloads. Right now, we’re coaching our fashions on two 24K-GPU clusters.

We don’t anticipate this upward trajectory for AI clusters to decelerate any time quickly. In truth, we anticipate the quantity of compute wanted for AI coaching will develop considerably from the place we’re at the moment.

Constructing AI clusters requires extra than simply GPUs. Networking and bandwidth play an vital position in making certain the clusters’ efficiency. Our techniques encompass a tightly built-in HPC compute system and an remoted high-bandwidth compute community that connects all our GPUs and domain-specific accelerators. This design is important to satisfy our injection wants and tackle the challenges posed by our want for bisection bandwidth.

Within the subsequent few years, we anticipate better injection bandwidth on the order of a terabyte per second, per accelerator, with equal normalized bisection bandwidth. This represents a progress of greater than an order of magnitude in comparison with at the moment’s networks!

To assist this progress, we want a high-performance, multi-tier, non-blocking community material that may make the most of trendy congestion management to behave predictably below heavy load. This may allow us to completely leverage the facility of our AI clusters and guarantee they proceed to carry out optimally as we push the boundaries of what’s doable with AI.

Scaling AI at this velocity requires open {hardware} options. Growing new architectures, community materials, and system designs is probably the most environment friendly and impactful after we can construct it on ideas of openness. By investing in open {hardware}, we unlock AI’s full potential and propel ongoing innovation within the subject.

Introducing Catalina: Open Structure for AI Infra

Catalina entrance view (left) and rear view (proper).

Right now, we introduced the upcoming launch of Catalina, our new high-powered rack designed for AI workloads, to the OCP group. Catalina relies on the NVIDIA Blackwell platform full rack-scale solution, with a deal with modularity and adaptability. It’s constructed to assist the newest NVIDIA GB200 Grace Blackwell Superchip, making certain it meets the rising calls for of contemporary AI infrastructure.

The rising energy calls for of GPUs means open rack options have to assist increased energy functionality. With Catalina we’re introducing the Orv3, a high-power rack (HPR) able to supporting as much as 140kW.

The complete answer is liquid cooled and consists of an influence shelf that helps a compute tray, change tray, the Orv3 HPR, the Wedge 400 material change, a administration change, battery backup unit, and a rack administration controller.

We purpose for Catalina’s modular design to empower others to customise the rack to satisfy their particular AI workloads whereas leveraging each present and rising trade requirements.

The Grand Teton Platform now helps AMD accelerators

In 2022, we introduced Grand Teton, our next-generation AI platform (the follow-up to our Zion-EX platform). Grand Teton is designed with compute capability to assist the calls for of memory-bandwidth-bound workloads, similar to Meta’s deep learning recommendation models (DLRMs), in addition to compute-bound workloads like content material understanding.

Now, now we have expanded the Grand Teton platform to assist the AMD Intuition MI300X and shall be contributing this new model to OCP. Like its predecessors, this new model of Grand Teton includes a single monolithic system design with absolutely built-in energy, management, compute, and material interfaces. This excessive degree of integration simplifies system deployment, enabling fast scaling with elevated reliability for large-scale AI inference workloads.

Along with supporting a spread of accelerator designs, now together with the AMD Intuition MI300x, Grand Teton gives considerably better compute capability, permitting quicker convergence on a bigger set of weights. That is complemented by expanded reminiscence to retailer and run bigger fashions domestically, together with elevated community bandwidth to scale up coaching cluster sizes effectively.

Open Disaggregated Scheduled Material

Growing open, vendor-agnostic networking backend goes to play an vital position going ahead as we proceed to push the efficiency of our AI coaching clusters. Disaggregating our community permits us to work with distributors from throughout the trade to design techniques which might be modern in addition to scalable, versatile, and environment friendly.

Our new Disaggregated Scheduled Fabric (DSF) for our next-generation AI clusters gives a number of benefits over our present switches. By opening up our community material we are able to overcome limitations in scale, part provide choices, and energy density. DSF is powered by the open OCP-SAI commonplace and FBOSS, Meta’s personal community working system for controlling community switches. It additionally helps an open and commonplace Ethernet-based RoCE interface to endpoints and accelerators throughout a number of GPUS and NICS from a number of completely different distributors, together with our companions at NVIDIA, Broadcom, and AMD.

Along with DSF, now we have additionally developed and constructed new 51T material switches based mostly on Broadcom and Cisco ASICs. Lastly, we’re sharing our new FBNIC, a brand new NIC module that accommodates our first Meta-design community ASIC. To be able to meet the rising wants of our AI

Meta and Microsoft: Driving Open Innovation Collectively

Meta and Microsoft have a long-standing partnership inside OCP, starting with the event of the Switch Abstraction Interface (SAI) for knowledge facilities in 2018. Over time collectively, we’ve contributed to key initiatives such because the Open Accelerator Module (OAM) commonplace and SSD standardization, showcasing our shared dedication to advancing open innovation.

Our present collaboration focuses on Mount Diablo, a brand new disaggregated energy rack. It’s a cutting-edge answer that includes a scalable 400 VDC unit that enhances effectivity and scalability. This modern design permits extra AI accelerators per IT rack, considerably advancing AI infrastructure. We’re excited to proceed our collaboration by way of this contribution.

The open way forward for AI infra

Meta is committed to open source AI. We consider that open supply will put the advantages and alternatives of AI into the fingers of individuals everywhere in the phrase.

AI received’t notice its full potential with out collaboration. We’d like open software program frameworks to drive mannequin innovation, guarantee portability, and promote transparency in AI growth. We should additionally prioritize open and standardized fashions so we are able to leverage collective experience, make AI extra accessible, and work in direction of minimizing biases in our techniques.

Simply as vital, we additionally want open AI {hardware} techniques. These techniques are vital for delivering the type of high-performance, cost-effective, and adaptable infrastructure vital for AI development.

We encourage anybody who desires to assist advance the way forward for AI {hardware} techniques to interact with the OCP group. By addressing AI’s infrastructure wants collectively, we are able to unlock the true promise of open AI for everybody.