【第79期】VisionZip：降低Visual token冗余度

Seventy3 · 任雨山

December 18, 202411m 31s

Audio is streamed directly from the publisher (dts-api.xiaoyuzhoufm.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

Seventy3: 用NotebookLM将论文生成播客，让大家跟着AI一起进步。

今天的主题是：

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Summary

The paper introduces VisionZip, a method to improve the efficiency of vision-language models (VLMs) by reducing redundancy in visual tokens. The authors observe that existing VLMs use excessively long visual token sequences, leading to high computational costs. VisionZip selects informative tokens, significantly improving inference speed and maintaining or even exceeding performance compared to state-of-the-art methods. The technique is applicable to various tasks, including multi-turn dialogues, and is shown to be effective across multiple VLM architectures. The paper also analyzes the causes of redundancy in visual tokens, highlighting the limitations of existing text-based token selection methods.

原文链接：https://arxiv.org/abs/2412.04467

← All episodes of Seventy3