PLAY PODCASTS
Star Attention: Efficient LLM Inference over Long Sequences
Episode 158

Star Attention: Efficient LLM Inference over Long Sequences

Daily Paper Cast

November 28, 202420m 34s

Audio is streamed directly from the publisher (media.transistor.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

🤗 Paper Upvotes: 32 | cs.CL, cs.AI, cs.LG

Authors:
Shantanu Acharya, Fei Jia, Boris Ginsburg

Title:
Star Attention: Efficient LLM Inference over Long Sequences

Arxiv:
http://arxiv.org/abs/2411.17116v1

Abstract:
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.