PLAY PODCASTS
Stable Reasoning in LLMs: A Novel Evaluation Metric and Benchmark

Stable Reasoning in LLMs: A Novel Evaluation Metric and Benchmark

AI Papers Podcast Daily · AIPPD

December 18, 202410m 26s

Audio is streamed directly from the publisher (media.rss.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

This research paper describes a new way to test how good large language models (LLMs) are at solving math problems. The researchers created a special test called LiveMathBench which uses difficult math problems from contests like the Chinese National Mathematical Olympiad and the American Mathematics Competition. They also created a new scoring system called G-Pass@k that measures not only if the LLM gets the right answer, but also how often it gets the right answer when it tries multiple times. They found that even the best LLMs had trouble consistently getting the right answers on these tough math problems. This means that simply making LLMs bigger doesn’t always make them better at math, and we need to find new ways to teach LLMs how to solve problems reliably.

https://arxiv.org/pdf/2412.13147