Online
Session 2: Overview of Underlying Networking Protocols for AI Deployments
Why AI workloads break traditional networks—and how modern protocols fix it
From Ethernet Limits to Lossless High-Performance Fabrics
1. Introduction: Your GPUs are only as fast as your network
Growth of distributed training
East-west traffic explosion
GPU starvation problem
2. Anatomy of an AI Training Job – Network perspective and requirements
Data parallel training
Gradient synchronization
Collective Communication patterns
Reduce
AllReduce (main)
ReduceScatter
AllGather
Broadcast
Point-to-Point Communication pattern
Send / Recv
3. RDMA Fundamentals – Why CPU becomes bottleneck without RDMA
Zero-copy → memory efficiency
Kernel bypass → latency
Queue pairs → parallelism
4. RDMA over Fabrics
InfiniBand
Native RDMA – Designed for HPC
Ethernet trying to behave like InfiniBand
RoCEv1 / RoCEv2
Layer 2 vs Layer 3 implications
5. The Hard Problem: Lossless Ethernet
Problems, why it breaks
Incast problem
Many-to-many patterns
Microbursts
Failure modes
Head-of-line blocking
Congestion spreading
Deadlocks
Solutions
PFC (Priority Flow Control) → prevents drops
ECN (Explicit Congestion Notification) → signals congestion
DCQCN (Data Center Quantized Congestion Notification) → controls rate
6. NVIDIA AI Networking Stack – Tying everything together
Inside the node
NVLink / NVSwitch → avoid the network whenever possible
Across nodes
NCCL (NVIDIA Collective Communication Library) → optimizes collectives
NVIDIA Spectrum-X – ASIC-level architecture → optimized Ethernet for AI
7. Closing: Business Impact
In AI infrastructure, network misconfiguration is not a minor issue, it directly translates into lost GPU ROI
Poor congestion control → GPU idle time
Packet loss → training retries
Latency spikes → longer training cycles
Complete
Registration
📅 May 7th 2026
⏰ 10 AM – 11.30 AM CST
📍 Virtual Room
Speaker
Agustin Ciciliani
Engineer at BVS One