PLAY PODCASTS
Keeping GPUs Ticking Like Clockwork
Episode 1559

Keeping GPUs Ticking Like Clockwork

Clockwork began with a narrow goal—keeping clocks synchronized across servers—but soon realized that its precise latency measurements could reveal deeper data center networking issues. This insight led the company to build a hardware-agnostic monitoring and remediation platform capable of automatically routing around faults. Today, Clockwork’s technology is especially valuable for large GPU clusters used in training LLMs, where communication efficiency and reliability are critical. CEO Suresh Vasudevan explains that AI workloads are among the most demanding distributed applications ever, and Clockwork provides building blocks that improve visibility, performance and fault tolerance.

The New Stack Podcast · Clockwork, Suresh Vasudevan, The New Stack, Frederic Lardinois

November 17, 202527m 8s

Audio is streamed directly from the publisher (cdn.simplecast.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Clockwork began with a narrow goal—keeping clocks synchronized across servers—but soon realized that its precise latency measurements could reveal deeper data center networking issues. This insight led the company to build a hardware-agnostic monitoring and remediation platform capable of automatically routing around faults. Today, Clockwork’s technology is especially valuable for large GPU clusters used in training LLMs, where communication efficiency and reliability are critical. CEO Suresh Vasudevan explains that AI workloads are among the most demanding distributed applications ever, and Clockwork provides building blocks that improve visibility, performance and fault tolerance. Its flagship feature, FleetIQ, can reroute traffic around failing switches, preventing costly interruptions that might otherwise force teams to restart training from hours-old checkpoints. Although the company originated from Stanford research focused on clock synchronization for financial institutions, the team eventually recognized that packet-timing data could underpin powerful network telemetry and dynamic traffic control. By integrating with NVIDIA NCCL, TCP and RDMA libraries, Clockwork can not only measure congestion but also actively manage GPU communication to enhance both uptime and training efficiency. 

Learn more from The New Stack about the latest in Clockwork: 

Clockwork’s FleetIQ Aims To Fix AI’s Costly Network Bottleneck 

What Happens When 116 Makers Reimagine the Clock? 

Join our community of newsletter subscribers to stay on top of the news and at the top of your game. 

 

Topics

ai networkfrederic lardinoissoftware developerclockworktech podcastthe new stackai developerhardware agnostic monitoringtechai developmentnetworksoftware engineerthe new stack agentssuresh vasudevangpu clusters