
Stable Reasoning in LLMs: A Novel Evaluation Metric and Benchmark
AI Papers Podcast Daily · AIPPD
Audio is streamed directly from the publisher (media.rss.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
This research paper describes a new way to test how good large language models (LLMs) are at solving math problems. The researchers created a special test called LiveMathBench which uses difficult math problems from contests like the Chinese National Mathematical Olympiad and the American Mathematics Competition. They also created a new scoring system called G-Pass@k that measures not only if the LLM gets the right answer, but also how often it gets the right answer when it tries multiple times. They found that even the best LLMs had trouble consistently getting the right answers on these tough math problems. This means that simply making LLMs bigger doesn’t always make them better at math, and we need to find new ways to teach LLMs how to solve problems reliably.