Wednesday, March 5, 2025

DeepSeek DeepEP - DeepEP is a communication library tailored for Mixture-of-Experts (MoE) and expert parallelism (EP)

DeepEP is a communication library tailored for Mixture-of-Experts (MoE) and expert parallelism (EP). It provides high-throughput and low-latency all-to-all GPU kernels, which are also as known as MoE dispatch and combine. The library also supports low-precision operations, including FP8.


(FP8 and FP16 are floating-point numerical formats. FP8 uses 8 bits to represent numbers with reduced precision, allowing faster computation while saving memory bandwidth.

)

(An NVLink domain is a collection of GPUs that are connected by NVLink and NVSwitch to enable high-speed communication. NVLink is a technology developed by NVIDIA to transfer data and control code between GPUs and CPUs.

How does NVLink work? NVLink is a point-to-point connection between GPUs. NVLink Switch chips connect multiple NVLinks to enable GPU communication within a rack and between racks. NVLink enables scaling of memory and performance to meet the demands of visual computing workloads. NVLink is used in large-scale GPU clusters, HPC (High Performance Computing), and other fields.

NVLink in action The NVIDIA GB200 NVL72 system uses fifth-generation NVLink to connect 72 NVIDIA Blackwell GPUs in a rack-scale design. The fifth-generation NVLink enables all 72 GPUs to act as a single GPU. This enables 30x faster real-time trillion-parameter inference compared to the prior generation.

)


(Remote Direct Memory Access (RDMA) is a technology that enables data to be transferred between two computers without using the CPU, cache, or operating system of either computer.

In RDMA (Remote Direct Memory Access), a protection domain (PD) is an object that associates memory regions, queue pairs, and memory windows. This association allows for the control of network adapter access to the host system's memory.

) To align with the group-limited gating algorithm proposed in the DeepSeek-V3 paper, DeepEP offers a set of kernels optimized for asymmetric-domain bandwidth forwarding, such as forwarding data from NVLink domain to RDMA domain. These kernels deliver high throughput, making them suitable for both training and inference prefilling tasks. Additionally, they support SM (Streaming Multiprocessors) number control.

For latency-sensitive inference decoding, DeepEP includes a set of low-latency kernels with pure RDMA to minimize delays. The library also introduces a hook-based communication-computation overlapping method that does not occupy any SM (Streaming Multiprocessors) resource.

https://github.com/deepseek-ai/DeepEP

Generative AI, Robot Operating System (ROS 2), Computer Vision, Natural Language Processing service, Generative AI Chatbot, Machine Learning, Mobile App, Web App? Yes, I do provide!


Call me: +84854147015

WhatsApp: +601151992689

https://amatasiam.web.app

Email: ThomasTrungVo@Gmail.Com

Facebook: 
https://www.facebook.com/voduytrung

X: 
https://x.com/ThomasTrung

No comments:

Post a Comment