PaliGemma 2: Versatile Vision-Language Models for Transfer

December 6, 202413m 15s

Audio is streamed directly from the publisher (media.rss.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page View transcript

Show Notes

PaliGemma 2 is an improved version of PaliGemma, a computer program that can understand both images and text. PaliGemma 2 uses a special part called a vision encoder to look at images, and a language model from the Gemma 2 family to understand text. These programs are trained on many different tasks, like captioning images, answering questions about images, and recognizing text in images. Researchers found that PaliGemma 2 is even better than PaliGemma at these tasks, especially when using a larger language model or looking at higher resolution images. PaliGemma 2 is also very good at other tasks, such as recognizing tables in documents, understanding the structure of molecules, and reading music notes. PaliGemma 2 can even be used to help doctors understand X-ray images.

https://arxiv.org/pdf/2412.03555

← All episodes of AI Papers Podcast Daily