
#131: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Misreading Chat · Hajime Morrita
April 23, 202430m 40s
Audio is streamed directly from the publisher (misreadingchat.files.wordpress.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
CUDA で書かれた PyTorch 用カーネルに森田が玉砕しました。ご意見感想などは Reddit やおたより投書箱にお寄せください。iTunes のレビューや星もよろしくね。
- [2205.14135] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- GitHub – Dao-AILab/flash-attention: Fast and memory-efficient exact attention
- GitHub – NVIDIA/apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
- [2307.08691] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- [2112.05682] Self-attention Does Not Need $O(n^2)$ Memory
- GitHub – tspeterkim/flash-attention-minimal: Flash Attention in ~100 lines of CUDA (forward pass only)