PLAY PODCASTS
Seventy3

Seventy3

620 episodes — Page 10 of 13

【第165期】DeepSeek-R1 和 OpenAI 的 o3-mini 安全性比较

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:o3-mini vs DeepSeek-R1: Which One is Safer?SummaryThe study assesses the safety of two large language models (LLMs), DeepSeek-R1 and OpenAI's o3-mini, using an automated testing tool called ASTRAL. It explores how these models respond to unsafe prompts across various categories, writing styles, and persuasion techniques. The research indicates that DeepSeek-R1 exhibits significantly more unsafe behaviors compared to o3-mini, particularly in categories like financial crime and violence. This suggests DeepSeek-R1 is less aligned with safety standards than o3-mini, and earlier OpenAI models, with potential implications for real-world applications. The researchers also note that OpenAI's policy violation safeguards may have influenced o3-mini's safety results, requiring further testing upon its full release. This work emphasizes the importance of robust safety evaluations for LLMs before widespread deployment.该研究评估了两个大型语言模型(LLM),DeepSeek-R1 和 OpenAI 的 o3-mini,在自动化测试工具 ASTRAL 下的安全性。研究探讨了这些模型在不同类别、写作风格和说服技巧下对不安全提示的响应情况。研究结果表明,DeepSeek-R1 在金融犯罪和暴力等类别中表现出明显更多的不安全行为,相较而言,o3-mini 的安全性更高。这表明 DeepSeek-R1 在安全标准上的对齐程度低于 o3-mini 以及 OpenAI 早期的模型,可能会对现实世界的应用产生影响。研究人员还指出,OpenAI 的政策违规防护机制可能影响了 o3-mini 的安全测试结果,因此需要在其完整发布后进行进一步测试。本研究强调,在广泛部署 LLM 之前,进行严格的安全评估至关重要。原文链接:https://arxiv.org/abs/2501.18438

Mar 14, 202516 min

【第164期】CodeMonkeys:软件工程中一种test time compute方法

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:CodeMonkeys: Scaling Test-Time Compute for Software EngineeringSummaryThe "CodeMonkeys" paper introduces a system that improves large language model (LLM) performance on software engineering tasks by scaling test-time compute. This scaling is achieved by iteratively generating and testing code edits, both serially (more iterations per attempt) and in parallel (multiple attempts simultaneously). The system identifies relevant code context, generates candidate edits with accompanying tests, and selects the best edit through voting and a dedicated selection process. By amortizing the cost of context identification and using a combination of test-based voting and model-based selection, CodeMonkeys achieves competitive results on the SWE-bench Verified dataset. The paper also explores combining edits from multiple sources, demonstrating the effectiveness of their selection method in heterogeneous ensembles. Furthermore, an exploration of DeepSeek-V3 as a cheaper alternative to Claude Sonnet 3.5 is analyzed for potential benefits.“CodeMonkeys” 论文提出了一种提升大语言模型(LLM)在软件工程任务上表现的系统,其核心思路是扩展测试时计算(test-time compute)。这种扩展通过迭代地生成和测试代码修改来实现,包括串行方式(在单次尝试中进行更多迭代)和并行方式(同时进行多个尝试)。该系统首先识别相关代码上下文,然后生成候选代码修改及其测试,并通过投票机制和专门的选择流程挑选最佳修改方案。通过摊销上下文识别成本,并结合基于测试的投票和基于模型的选择,CodeMonkeys 在 SWE-bench Verified 数据集上取得了具备竞争力的结果。此外,论文还探索了如何整合来自多个来源的代码修改,验证了该系统在异构集成(heterogeneous ensembles)中的有效性。同时,研究对比了DeepSeek-V3 作为 Claude Sonnet 3.5 的低成本替代方案,分析了其潜在优势。原文链接:https://arxiv.org/abs/2501.14723

Mar 13, 202517 min

【第163期】Encoder-Decoder架构的SLM

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Return of the Encoder: Maximizing Parameter Efficiency for SLMsSummaryThis paper challenges the current trend of using decoder-only architectures for language models, particularly for smaller language models (SLMs). It argues that encoder-decoder architectures offer superior efficiency and performance in resource-constrained environments, especially regarding latency and throughput on edge devices. The researchers introduce a knowledge distillation framework that allows encoder-decoder models to learn from larger decoder-only models while maintaining their architectural advantages. They also demonstrate the benefits of encoder-decoder models in vision-language tasks by integrating a vision encoder. Their findings suggest that focusing on architectural choices is crucial for creating efficient SLMs, especially for on-device deployment, rather than simply scaling down large models. They show that encoder-decoder models with knowledge distillation can outperform decoder-only models and reduce latency significantly.该论文对当前以解码器(decoder-only)架构为主的语言模型趋势提出质疑,尤其针对小型语言模型(Small Language Models, SLMs)。研究表明,在资源受限环境(如边缘设备)中,编码器-解码器(encoder-decoder)架构在延迟和吞吐量方面表现更优,具备更高的效率和性能。为此,研究者提出了一种知识蒸馏(knowledge distillation)框架,使编码器-解码器模型能够从更大的解码器模型学习,同时保持其架构优势。此外,论文还通过集成视觉编码器(vision encoder),验证了编码器-解码器模型在视觉-语言任务中的优势。研究结果表明,优化架构选择比单纯缩小大模型规模更关键,尤其是在**端侧部署(on-device deployment)**的场景中。实验进一步证明,结合知识蒸馏的编码器-解码器模型不仅优于解码器模型,还能显著降低延迟。原文链接:https://arxiv.org/abs/2501.16273

Mar 12, 202516 min

【第162期】ICRL:一种通用问题解决方法

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:RL + Transformer = A General-Purpose Problem SolverSummaryThis paper introduces an innovative approach called In-Context Reinforcement Learning (ICRL) that utilizes a pre-trained transformer model to solve problems, even those it hasn't seen before. The model, Llama 3.1 8B, is fine-tuned with reinforcement learning, enabling it to meta-learn and adapt to new environments with remarkable efficiency. The ICRL-trained transformer demonstrates the ability to combine learned skills, handle suboptimal training data, and adjust to changing environments, showcasing its potential as a general-purpose problem solver. The study assesses its performance on in-distribution and out-of-distribution environments, highlighting its ability to stitch together behaviors from its context and improve its solutions iteratively. The results indicate that ICRL holds promise for developing AI systems with human-like adaptability, although the ethical implications of autonomous agents are also considered and discussed. The work also reveals challenges related to exploration, suggesting potential avenues for future research to enhance the capabilities of ICRL-trained transformers.该论文提出了一种创新方法——上下文强化学习(In-Context Reinforcement Learning, ICRL),该方法利用 预训练变换器模型 解决问题,包括此前未曾见过的问题。研究采用 Llama 3.1 8B 作为基础模型,并通过强化学习进行微调,使其具备元学习能力,从而能够高效适应新环境。实验表明,ICRL 训练的变换器能够整合已学技能、处理次优训练数据,并适应环境变化,展现出其作为通用问题求解器的潜力。研究评估了该模型在分布内(in-distribution)与分布外(out-of-distribution)环境中的表现,强调其能够基于上下文拼接行为(stitch together behaviors)并迭代优化解决方案。结果表明,ICRL 有望推动具备类人适应能力的人工智能系统的发展,同时研究也探讨了自主智能体的伦理影响。此外,研究揭示了 ICRL 在探索方面的挑战,并提出了未来研究方向,以进一步提升 ICRL 训练的变换器的能力。原文链接:https://arxiv.org/abs/2501.14176

Mar 11, 202515 min

【第161期】VideoWorld:从无标签视频数据中学习复杂知识

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:VideoWorld: Exploring Knowledge Learning from Unlabeled VideosSummaryThe paper introduces VideoWorld, a novel approach to learning complex knowledge directly from unlabeled video data. It presents a video generation model that, unlike traditional language models, learns rules, reasoning, and planning skills solely from visual input, exemplified through tasks like video-based Go and robotic control. A key finding is that visual change representation is vital for knowledge acquisition, leading to the development of a Latent Dynamics Model (LDM) for enhanced efficiency. Remarkably, VideoWorld achieves high proficiency in Video-GoBench and demonstrates effective robotic control, rivaling oracle models. This research pioneers a new direction for AI learning, emphasizing the potential of visual data as a primary source of knowledge. The supplementary material gives additional details about the implementation and results.该论文介绍了 VideoWorld,一种全新的方法,能够直接从无标签视频数据中学习复杂知识。论文提出了一种视频生成模型,不同于传统的语言模型,该模型仅依赖视觉输入学习规则、推理和规划能力,并通过视频版围棋(Video-based Go)和机器人控制等任务加以验证。研究的一个关键发现是,视觉变化的表示对于知识获取至关重要,据此提出了 潜在动力学模型(Latent Dynamics Model, LDM) 以提高学习效率。令人瞩目的是,VideoWorld 在 Video-GoBench 基准测试中表现出色,并在机器人控制任务上展现了可比肩先验模型(oracle models)的能力。这项研究开辟了 人工智能学习的新方向,强调了视觉数据作为知识主要来源的潜力。补充材料提供了关于实现细节和实验结果的更多信息。原文链接:https://arxiv.org/abs/2501.09781

Mar 10, 202516 min

【第160期】AI Red Teaming实践经验总结

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Lessons From Red Teaming 100 Generative AI ProductsSummaryAI red teaming, a practice for assessing the safety and security of generative AI systems, is explored in this paper, drawing from Microsoft's experience red teaming over 100 GenAI products. The authors share their internal threat model ontology and eight lessons learned, highlighting the importance of understanding system capabilities, prioritizing simple attack techniques, and recognizing that red teaming differs from safety benchmarking. Automation with tools like PyRIT can enhance red teaming, but human expertise remains critical, especially in assessing responsible AI harms. The paper stresses that LLMs amplify existing security risks and introduce new vulnerabilities. Securing AI systems is an ongoing process, requiring economic considerations, break-fix cycles, and policy regulation.本论文探讨了 AI Red Teaming(人工智能红队测试)这一实践,用于评估 生成式 AI 系统的安全性和可靠性。研究借鉴了 微软在 100 多个 GenAI 产品上的红队测试经验,分享了内部威胁模型本体(threat model ontology)及八大经验教训。作者强调了以下关键点: 理解系统能力 对于有效评估至关重要。 优先采用简单的攻击技术,往往比复杂方法更能暴露漏洞。 红队测试不同于安全基准测试,它更侧重于主动发现系统弱点。尽管 PyRIT 等自动化工具可以提升红队测试效率,但 人类专家仍然不可或缺,特别是在评估 负责任 AI 相关风险 方面。此外,论文指出 大语言模型(LLMs)不仅放大了已有的安全风险,还引入了新的漏洞。最终,研究强调 AI 安全是一个持续的过程,涉及 经济成本、漏洞修复周期(break-fix cycles)及政策监管,需要跨学科协作来确保 AI 系统的安全性。原文链接:https://arxiv.org/abs/2501.07238

Mar 9, 202519 min

【第159期】TheAgentCompany:评估 AI 代理在真实工作场景中执行任务的新基准

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:TheAgentCompany: Benchmarking LLM Agents on Consequential Real World TasksSummaryTheAgentCompany is introduced as a new benchmark for evaluating AI agents on real-world workplace tasks. This benchmark simulates a software company environment where agents perform tasks like web browsing, coding, and communication with simulated colleagues. The paper assesses the performance of various large language models (LLMs) on these tasks, revealing that even the best models struggle to autonomously complete most of them. The authors identify challenges such as social interaction, navigating complex UIs, and the lack of training data for certain professional tasks. The benchmark aims to provide insights into the current capabilities and limitations of AI agents in automating work-related tasks. The benchmark also includes a breakdown of the employee roster of TheAgentCompany and examples of conversation between agents and simulated colleagues within their environment. The paper concludes by discussing the implications of their findings and suggesting directions for future research and benchmark improvements.TheAgentCompany 是一个用于评估 AI 代理在真实工作场景中执行任务的新基准测试。该基准模拟了一个软件公司环境,AI 代理需要完成 网页浏览、编写代码和与模拟同事沟通 等任务。论文评估了多种 大语言模型(LLMs) 在这些任务中的表现,结果表明,即使是最先进的模型仍难以自主完成大多数任务。研究指出了 社交交互、复杂 UI 导航 以及 某些专业任务缺乏训练数据 等关键挑战。TheAgentCompany 旨在揭示 AI 代理在自动化工作任务中的当前能力与局限性。基准测试还包括公司员工角色的详细设定,以及 AI 代理与模拟同事之间的对话示例。论文最后讨论了研究结果的影响,并提出了未来研究方向及基准改进建议。原文链接:https://arxiv.org/abs/2412.14161

Mar 8, 202518 min

【第158期】图像生成CoT是什么样的

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by StepSummaryThis research explores enhancing autoregressive image generation using Chain-of-Thought (CoT) reasoning strategies commonly applied to language models. The study adapts techniques like test-time verification and preference alignment to improve image quality and text alignment. The authors introduce a Potential Assessment Reward Model (PARM) and PARM++ to better evaluate and refine image generation steps. PARM adaptively assesses potential during generation while PARM++ incorporates a reflection mechanism for self-correction. Experiments show significant improvements over existing methods, including Stable Diffusion, highlighting the potential of CoT reasoning in image generation. The authors provide insights into adapting these strategies and show the effectiveness of tailored reward models.本研究探讨了如何利用 Chain-of-Thought (CoT) 思维链推理策略来增强自回归图像生成,这些策略通常应用于语言模型。研究采用 测试时验证 和 偏好对齐 等技术,以提高图像质量和文本对齐度。作者提出了 潜在性评估奖励模型(PARM) 及其增强版本 PARM++,用于优化图像生成过程。PARM 在生成过程中自适应地评估潜在质量,而 PARM++ 进一步引入反思机制,实现自我修正。实验结果表明,该方法相较于现有技术(包括 Stable Diffusion)具有显著优势,验证了 CoT 推理 在图像生成中的潜力。研究还深入探讨了如何调整这些策略,并展示了定制化奖励模型的有效性。原文链接:https://arxiv.org/abs/2501.13926

Mar 7, 202517 min

【第157期】DiffuEraser:利用稳定扩散技术修复视频

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:DiffuEraser: A Diffusion Model for Video InpaintingSummaryDiffuEraser is a novel video inpainting model leveraging stable diffusion to address limitations in existing methods. Current video inpainting techniques often struggle with blurring and temporal inconsistencies, especially with large masked areas. DiffuEraser enhances detail and coherence by incorporating prior information to guide the diffusion process and suppress unwanted artifacts. To improve temporal consistency across extended video sequences, it expands the temporal receptive fields and uses the Video Diffusion Model's smoothing properties. The model decomposes video inpainting into known pixel propagation, unknown pixel generation, and temporal consistency, offering targeted solutions for each. Ultimately, DiffuEraser outperforms existing methods by producing more complete and temporally consistent results.DiffuEraser 是一种新颖的视频修复模型,利用稳定扩散技术来解决现有方法的局限性。当前的视频修复技术在处理大面积遮挡时,常常面临模糊和时间一致性差的问题。DiffuEraser 通过引入先验信息来引导扩散过程,有效增强细节和整体连贯性,同时抑制不必要的伪影。为了提高长视频序列的时间一致性,模型扩展了时间感受野,并利用 Video Diffusion Model 的平滑特性。DiffuEraser 将视频修复任务分解为已知像素传播、未知像素生成和时间一致性维护,并针对每个环节提供专门的解决方案。实验结果表明,该方法相比现有技术能够生成更完整且时间一致性更高的视频内容。原文链接:https://arxiv.org/abs/2501.10018

Mar 6, 202521 min

【第156期】Mobile-Agent-E:智能手机上的Agent

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex TasksSummaryThis research introduces Mobile-Agent-E, an innovative mobile assistant designed to handle complex, real-world tasks on smartphones. The system employs a hierarchical multi-agent framework with a self-evolution module, enabling it to learn from past experiences and improve its performance over time. Mobile-Agent-E separates high-level planning from low-level action execution, utilizing specialized agents for perception, operation, reflection, and note-taking. A novel benchmark, Mobile-Eval-E, is introduced to evaluate the agent's capabilities on challenging, multi-app tasks. Experimental results demonstrate significant improvements over existing approaches, showcasing the effectiveness of the hierarchical design and self-evolution mechanism. The study also analyzes the impact of self-generated tips and shortcuts, paving the way for more efficient and user-friendly mobile agents.本研究介绍了 Mobile-Agent-E,一款创新的移动助手,专为在智能手机上处理复杂的现实世界任务而设计。该系统采用分层多智能体框架,并引入自进化模块,使其能够从过去的经验中学习,并随着时间推移不断提升性能。Mobile-Agent-E 将高层规划与低层动作执行分离,利用专门的智能体负责感知、操作、反思和笔记记录。此外,研究提出了一个新的基准测试 Mobile-Eval-E,用于评估智能体在复杂的多应用任务中的能力。实验结果表明,该方法相较于现有方案有显著提升,验证了分层设计与自进化机制的有效性。研究还分析了智能体自生成的提示和快捷方式的影响,为开发更高效、用户友好的移动智能体奠定了基础。原文链接:https://arxiv.org/abs/2501.11733

Mar 5, 202512 min

【第155期】IntellAgent:多智能体框架

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI SystemsSummaryThis document introduces IntellAgent, a novel, open-source multi-agent framework designed to evaluate conversational AI systems. IntellAgent addresses the shortcomings of traditional methods by automating the creation of diverse, realistic scenarios using policy-driven graph modeling, event generation, and user-agent simulations. The framework leverages a policy graph to represent policy relationships and complexities, enabling detailed diagnostics of agent performance. Unlike existing benchmarks, IntellAgent offers fine-grained insights into policy adherence and identifies specific areas for improvement. Experiments show that IntellAgent provides a robust alternative for evaluating conversational agents and correlating with existing benchmarks, despite relying on synthetic data. The system is implemented using Langgraph and provides a means to assess different large language models in complex chatbot environments.本文件介绍了 IntellAgent,一个新颖的开源多智能体框架,旨在评估对话式人工智能系统。IntellAgent 通过策略驱动的图建模、事件生成和用户代理模拟,自动创建多样化且逼真的场景,从而弥补了传统方法的不足。该框架利用策略图来表示策略关系及其复杂性,使得对智能体的性能进行详细诊断成为可能。与现有基准测试不同,IntellAgent 能够提供细粒度的洞察,评估策略遵循情况并识别具体的改进点。实验表明,尽管依赖于合成数据,IntellAgent 依然能够作为评估对话代理的有力替代方案,并与现有基准测试结果呈现相关性。该系统基于 Langgraph 实现,并可用于评估不同的大型语言模型在复杂聊天机器人环境中的表现。原文链接:https://arxiv.org/abs/2501.11067

Mar 4, 202512 min

【第154期】Agentic RAG survey

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Agentic Retrieval-Augmented Generation: A Survey on Agentic RAGSummaryThe provided text is a survey of Agentic Retrieval-Augmented Generation (RAG), a paradigm that enhances large language models by integrating autonomous AI agents into the RAG pipeline. This allows for dynamic retrieval strategies, contextual understanding, and iterative refinement, addressing the limitations of traditional RAG systems. The survey covers the evolution of RAG paradigms, detailed Agentic RAG architectures, and applications across industries like healthcare, finance, and education. It also explores implementation strategies, challenges in scaling, ethical considerations, performance optimization, and relevant frameworks and tools. Finally, the survey provides an overview of benchmarks and datasets used to evaluate RAG systems.这篇文章是关于代理化检索增强生成(Agentic RAG)的综述,介绍了一种通过将自主AI代理集成到RAG流程中来增强大型语言模型的范式。通过这种方式,RAG能够实现动态的检索策略、上下文理解和迭代优化,克服了传统RAG系统的局限性。综述涵盖了RAG范式的演变、详细的代理化RAG架构以及在医疗、金融和教育等行业中的应用。文章还探讨了实现策略、扩展中的挑战、伦理考量、性能优化,以及相关的框架和工具。最后,文章提供了评估RAG系统所使用的基准和数据集的概述。原文链接:https://arxiv.org/abs/2501.09136

Mar 3, 202513 min

【第153期】Chain-of-Agents框架

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Chain of Agents: Large Language Models Collaborating on Long-Context TasksSummaryLarge language models (LLMs) struggle with long contexts due to limitations in processing extensive information. The "Chain-of-Agents" (CoA) framework addresses this by using multiple LLM agents that collaborate to process long documents. CoA divides the input into segments, assigns each segment to a worker agent, and then uses a manager agent to integrate the information and produce a final output. This method outperforms traditional approaches like Retrieval-Augmented Generation (RAG) and full-context LLMs, particularly in question answering, summarization, and code completion tasks. CoA also mitigates issues with focus within long contexts and is task-agnostic, training-free, and highly interpretable. Ultimately, the "Chain-of-Agents" framework facilitates improved processing and reasoning over long contexts, expanding the potential applications of LLMs in various domains.大型语言模型(LLMs)在处理长上下文时面临困难,因为它们在处理大量信息时存在限制。为了应对这一挑战,"Chain-of-Agents"(CoA)框架通过使用多个LLM代理来协作处理长文档。CoA将输入划分为多个片段,将每个片段分配给一个工作代理,然后通过一个管理代理整合信息,最终生成输出。这种方法在问答、摘要和代码补全等任务中,特别是在处理长文档时,表现优于传统的检索增强生成(RAG)和全上下文LLM。CoA还解决了长上下文中的注意力问题,并且是任务无关的、无需训练的,并且具有高度的可解释性。最终,"Chain-of-Agents"框架通过提高长上下文的处理和推理能力,扩展了LLM在各个领域的潜在应用。原文链接:https://arxiv.org/abs/2406.02818

Mar 2, 202514 min

【第152期】Kimi k1.5

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Kimi k1.5: Scaling Reinforcement Learning with LLMsSummaryThis technical report introduces Kimi k1.5, a multimodal large language model trained with reinforcement learning (RL). The report highlights the model's training techniques, including long context scaling and policy optimization, emphasizing a simplistic yet effective RL framework. Kimi k1.5 achieves state-of-the-art reasoning performance across several benchmarks, even outperforming models like OpenAI's o1 and GPT-4o in certain short-CoT reasoning tasks. A key aspect is the exploration of long-context RL, with the model trained on sequences up to 128k tokens and improved policy optimization that uses a variant of online mirror descent for robust policy optimization. Furthermore, the report details long2short methods, infrastructure optimization, and ablation studies, showcasing Kimi k1.5's advancements in multi-modal AI capabilities and token efficiency.这份技术报告介绍了Kimi k1.5,一款通过强化学习(RL)训练的多模态大型语言模型。报告重点讲述了模型的训练技术,包括长上下文扩展和策略优化,强调了一种简洁而有效的RL框架。Kimi k1.5在多个基准测试中达到了最先进的推理表现,甚至在某些短链推理任务中超越了OpenAI的o1和GPT-4o模型。一个关键方面是对长上下文RL的探索,该模型训练时处理的序列长度可达128k个tokens,并采用一种在线镜像下降的变种方法进行强化的策略优化。报告还详细介绍了长2短方法、基础设施优化和消融研究,展示了Kimi k1.5在多模态AI能力和token效率方面的进展。原文链接:https://arxiv.org/abs/2501.12599

Mar 1, 202522 min

【第151期】Humanity’s Last Exam

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Humanity’s Last ExamSummary"Humanity's Last Exam" (HLE) introduces a new benchmark designed to assess the knowledge of large language models (LLMs) at the frontier of human expertise. This dataset contains 3,000 multiple-choice and short-answer questions across various subjects, emphasizing deep reasoning skills and resistance to simple internet retrieval. The questions undergo a rigorous review process by subject-matter experts to ensure difficulty and quality. Evaluations reveal that current LLMs exhibit low accuracy and poor calibration on HLE, indicating a significant gap in capabilities. The authors suggest HLE offers a reference point for AI progress and informs discussions on AI risks and governance. The creation of the data was a global effort by almost 1000 expert contributors.《人类最后的考试》(HLE)推出了一个新基准,旨在评估大型语言模型(LLMs)在接近人类专家前沿领域的知识水平。该数据集包含3000个多项选择题和简答题,涵盖多个学科,重点考察深度推理能力并避免简单的互联网检索。所有问题都经过了学科专家的严格审查,确保难度和质量。评估结果显示,当前的LLM在HLE上的准确性较低,且校准效果差,表明其能力存在显著差距。作者认为,HLE为AI进展提供了一个参考点,并为AI风险与治理的讨论提供了依据。该数据的创建是由近1000名专家贡献的全球合作成果。原文链接:https://arxiv.org/abs/2501.14249

Feb 28, 202511 min

【第150期】DeepSeek-R1

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningSummaryDeepSeek-AI introduces DeepSeek-R1-Zero and DeepSeek-R1, reasoning-focused large language models. DeepSeek-R1-Zero uses reinforcement learning (RL) without supervised fine-tuning (SFT) to achieve remarkable reasoning capabilities. DeepSeek-R1 builds upon this by incorporating multi-stage training and "cold-start" data before RL, achieving results comparable to OpenAI's models. The company releases DeepSeek-R1-Zero, DeepSeek-R1, and distilled smaller models to support the research community. Experiments demonstrate that DeepSeek-R1 excels in reasoning tasks, outperforming other models in certain benchmarks, and distillation from DeepSeek-R1 greatly improves the reasoning abilities of smaller models. The study explores the benefits of RL and distillation, also discussing unsuccessful methods like Process Reward Models and Monte Carlo Tree Search.DeepSeek-AI推出了DeepSeek-R1-Zero和DeepSeek-R1,这两款专注于推理的大型语言模型。DeepSeek-R1-Zero通过强化学习(RL)实现了显著的推理能力,而无需监督微调(SFT)。DeepSeek-R1在此基础上进一步发展,结合了多阶段训练和“冷启动”数据,在进行RL之前进行预训练,取得了与OpenAI模型相当的成果。公司发布了DeepSeek-R1-Zero、DeepSeek-R1以及经过蒸馏的小型模型,以支持研究社区。实验表明,DeepSeek-R1在推理任务上表现出色,在某些基准测试中超越了其他模型,并且从DeepSeek-R1进行蒸馏显著提升了小型模型的推理能力。研究还探讨了强化学习和蒸馏的优势,并讨论了如过程奖励模型和蒙特卡洛树搜索等未能成功的方法。原文链接:https://arxiv.org/abs/2501.12948

Feb 27, 202515 min

【第149期】Mind Evolution:一种进化搜索策略

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Evolving Deeper LLM ThinkingSummaryThis paper introduces Mind Evolution, a novel evolutionary search strategy for enhancing the problem-solving capabilities of Large Language Models (LLMs) in natural language planning. The method uses an LLM to generate, combine, and refine potential solutions iteratively, guided by feedback from an evaluator. Mind Evolution outperforms existing inference strategies by effectively leveraging inference time compute without needing a formal problem definition. The paper showcases impressive results on benchmarks like TravelPlanner and Natural Plan, even introducing a new challenging task called StegPoet. The core innovation lies in its ability to optimize solutions directly in natural language space, eliminating the need for task formalization. Ablation studies confirm the importance of critical conversation and feedback mechanisms within the evolutionary process. The authors demonstrate that the approach can achieve high success rates, sometimes even exceeding 99%, and point to the potential for future development of LLM-based evaluators to broaden the scope of application.本文介绍了Mind Evolution,这是一种新颖的进化搜索策略,旨在提升大型语言模型(LLMs)在自然语言规划中的问题解决能力。该方法利用LLM生成、组合和迭代优化潜在解决方案,并通过评估器的反馈指导进程。Mind Evolution通过有效利用推理时的计算资源,超越了现有的推理策略,且无需正式的问题定义。本文在TravelPlanner和Natural Plan等基准任务上展示了令人印象深刻的结果,并引入了一个名为StegPoet的新挑战任务。其核心创新在于能够直接在自然语言空间中优化解决方案,省去了任务形式化的需求。消融实验确认了在进化过程中关键对话和反馈机制的重要性。作者证明该方法能够实现高成功率,有时甚至超过99%,并指出未来开发基于LLM的评估器具有扩大应用范围的潜力。原文链接:https://arxiv.org/abs/2501.09891

Feb 26, 202530 min

【第148期】Embodied-RAG:赋予机器人在复杂环境中更强的记忆和推理能力

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and GenerationSummaryThis paper introduces Embodied-RAG, a novel framework designed to equip robots with enhanced memory and reasoning capabilities in complex environments. It tackles challenges in applying Retrieval-Augmented Generation (RAG) to robotics by constructing a hierarchical semantic forest for efficient knowledge storage and retrieval. Embodied-RAG integrates multimodal data and spatial awareness, outperforming existing RAG methods in navigation and explanation tasks. A new dataset, the Embodied-Experiences Dataset, is introduced to facilitate further research in this area. The core innovation lies in the system's ability to build and utilize a hierarchical spatial memory, enabling robots to navigate and communicate more effectively across diverse environments and query types. This work provides a foundation for developing generalist robot agents with language-based non-parametric memories.本文介绍了Embodied-RAG,这是一种新型框架,旨在赋予机器人在复杂环境中更强的记忆和推理能力。它通过构建一个层次化的语义森林来解决将检索增强生成(RAG)应用于机器人领域的挑战,从而实现高效的知识存储和检索。Embodied-RAG 集成了多模态数据和空间意识,在导航和解释任务中优于现有的RAG方法。文中还引入了一个新的数据集——Embodied-Experiences 数据集,以促进该领域的进一步研究。该系统的核心创新在于其构建和利用层次化空间记忆的能力,使机器人能够更有效地在不同的环境和查询类型中进行导航和交流。这项工作为开发基于语言的非参数记忆的通用机器人智能体奠定了基础。原文链接:https://arxiv.org/abs/2409.18313

Feb 25, 202516 min

【第147期】VideoRAG

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:VideoRAG: Retrieval-Augmented Generation over Video CorpusSummaryVideoRAG is a novel framework that enhances Retrieval-Augmented Generation (RAG) by incorporating video content. Unlike traditional RAG, which primarily uses text, VideoRAG dynamically retrieves relevant videos and integrates both visual and textual information from them to generate more accurate and contextually rich answers. This approach leverages Large Video Language Models (LVLMs) to directly process video content and seamlessly combine it with queries. Experimental results demonstrate VideoRAG's superiority over existing RAG baselines, proving the effectiveness of using videos as a knowledge source. The study also addresses the challenge of missing video subtitles by generating auxiliary text using automatic speech recognition. Finally, the exploration of different modalities and their combinations underscores the importance of both visual and textual features in video-based RAG.VideoRAG 是一种新型框架,通过引入视频内容增强了检索增强生成(RAG)。与传统的RAG主要依赖文本不同,VideoRAG 动态地检索相关视频,并从中整合视觉和文本信息,以生成更准确、更具上下文丰富性的答案。这一方法利用大型视频语言模型(LVLMs)直接处理视频内容,并将其与查询无缝结合。实验结果表明,VideoRAG 优于现有的RAG基准,证明了使用视频作为知识来源的有效性。该研究还解决了缺失视频字幕的问题,通过自动语音识别生成辅助文本。最后,不同模态及其组合的探索强调了视觉和文本特征在基于视频的RAG中的重要性。原文链接:https://arxiv.org/abs/2501.05874

Feb 24, 202513 min

【第146期】如何训练能量模型EBM

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:How to Train Your Energy-Based ModelsSummaryEnergy-Based Models (EBMs) offer a flexible approach to probabilistic modeling by specifying probability up to a normalizing constant, enabling the use of versatile architectures. The challenge lies in training these models due to the intractable normalizing constant. This document introduces and compares modern EBM training methods, focusing on Maximum Likelihood with Markov Chain Monte Carlo (MCMC), Score Matching (SM), and Noise Contrastive Estimation (NCE). The document elucidates the theoretical connections among these techniques and briefly explores alternative training methodologies. It also highlights the application of these techniques to score-based generative models. Finally, it discusses minimizing differences or derivatives of KL Divergences, minimizing the Stein discrepancy, and adversarial training.能量基模型(EBMs)通过指定概率直到归一化常数,提供了一种灵活的概率建模方法,从而能够使用多种架构。训练这些模型的挑战在于归一化常数难以计算。本文介绍并比较了现代EBM训练方法,重点讨论了最大似然估计结合马尔可夫链蒙特卡洛(MCMC)、评分匹配(SM)和噪声对比估计(NCE)。文章阐明了这些技术之间的理论联系,并简要探讨了其他替代训练方法。同时,文章还重点介绍了这些技术在基于评分的生成模型中的应用。最后,本文讨论了最小化KL散度的差异或导数、最小化Stein差异性和对抗性训练的相关内容。原文链接:https://arxiv.org/abs/2101.03288

Feb 23, 202522 min

【第145期】扩散模型的Inference-Time Scaling

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Inference-Time Scaling for Diffusion Models beyond Scaling Denoising StepsSummaryThis research explores enhancing diffusion models by scaling inference-time computation beyond simply increasing denoising steps. The authors propose a search framework that identifies better noises for the diffusion sampling process. This framework considers verifiers for feedback and algorithms to find noise candidates. Experiments on image generation show that increasing inference-time compute through this search framework improves sample quality. The study also analyzes the alignment between verifiers and generation tasks, revealing inherent biases. Ultimately, findings demonstrate substantial improvements in sample generation by diffusion models with increased computing power and a carefully chosen search setup.这项研究探讨了通过扩大推理时计算量来提升扩散模型的表现,而不仅仅是增加去噪步骤。作者提出了一个搜索框架,用于识别更适合扩散采样过程的噪声。该框架考虑了反馈验证器和算法,用于寻找噪声候选项。图像生成实验表明,通过这一搜索框架增加推理时计算量能够提升样本质量。研究还分析了验证器与生成任务之间的对齐情况,揭示了固有的偏差。最终,研究结果表明,通过增加计算能力和精心选择搜索设置,扩散模型在样本生成方面实现了显著的提升。原文链接:https://arxiv.org/abs/2501.09732

Feb 22, 202516 min

【第144期】Transformer-Squared:自适应LLM框架

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Transformer-Squared: Self-adaptive LLMsSummaryThis research paper introduces Transformer2, a novel self-adaptive large language model (LLM) framework. Transformer2 uses Singular Value Fine-tuning (SVF), a parameter-efficient method, to train "expert" vectors for specific tasks using reinforcement learning. During inference, a two-pass mechanism dynamically combines these experts based on the input prompt, significantly improving performance over existing methods like LoRA. The paper presents three adaptation strategies and demonstrates Transformer2's effectiveness across various LLMs and tasks, including vision-language models. The authors also explore cross-model compatibility and discuss avenues for future research.这篇研究论文介绍了Transformer2,一个新型自适应大型语言模型(LLM)框架。Transformer2使用奇异值微调(SVF)这一参数高效的方法,通过强化学习为特定任务训练“专家”向量。在推理过程中,Transformer2采用双通道机制,根据输入提示动态地组合这些专家,从而显著提高了性能,优于现有方法如LoRA。论文提出了三种适应策略,并展示了Transformer2在多个LLM和任务上的有效性,包括视觉-语言模型。作者还探讨了跨模型的兼容性,并讨论了未来研究的方向。原文链接:https://arxiv.org/abs/2501.06252

Feb 21, 202523 min

【第143期】构建能够终身学习的大型语言模型(LLM)代理

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Lifelong Learning of Large Language Model based Agents: A RoadmapSummaryThis paper surveys techniques for building large language model (LLM) agents capable of lifelong learning. It categorizes key agent components into perception, memory, and action modules, emphasizing how these modules enable continuous adaptation and mitigate catastrophic forgetting. The authors explore various strategies for each module, including multimodal perception, diverse memory types (working, episodic, semantic, parametric), and grounding, retrieval, and reasoning actions. The paper also reviews relevant evaluation metrics and discusses real-world applications. Finally, it provides insights into future research directions, focusing on improving the integration and scalability of these modules for more robust and human-like learning.这篇论文综述了构建能够终身学习的大型语言模型(LLM)代理的方法。论文将关键的代理组件分为感知、记忆和行动模块,强调这些模块如何促进持续适应并减轻灾难性遗忘。作者探讨了每个模块的各种策略,包括多模态感知、多样化的记忆类型(工作记忆、情节记忆、语义记忆、参数化记忆)以及基础、检索和推理行动。论文还回顾了相关的评估指标,并讨论了这些技术在现实世界中的应用。最后,作者提供了对未来研究方向的见解,重点是改进这些模块的集成性和可扩展性,以实现更强大和更像人类的学习能力。原文链接:https://arxiv.org/abs/2501.07278

Feb 20, 202520 min

【第142期】Titans:神经长期记忆模块

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Titans: Learning to Memorize at Test TimeSummaryThis research paper introduces Titans, a novel family of neural architectures designed to improve long-term memory in sequence modeling. Titans incorporate a new neural long-term memory module that learns to memorize historical context at test time, addressing the limitations of Transformers and existing recurrent models. The model uses a "surprise" metric to determine what information to remember and a forgetting mechanism to manage memory capacity. Three Titans variants—Memory as a Context, Memory as a Gate, and Memory as a Layer—are presented, showcasing different ways to integrate the long-term memory module. Experimental results across various tasks demonstrate Titans' superior performance and scalability to extremely long contexts.这篇研究论文介绍了Titans,一种新型神经网络架构家族,旨在改善序列建模中的长期记忆。Titans引入了一个新的神经长期记忆模块,能够在测试时学习记住历史上下文,解决了Transformer和现有循环模型的局限性。该模型使用“惊讶”度量来决定记住哪些信息,并采用遗忘机制来管理记忆容量。论文提出了三种Titans变体——“记忆作为上下文”、“记忆作为门控”和“记忆作为层”,展示了集成长期记忆模块的不同方式。跨多个任务的实验结果表明,Titans在处理极长上下文时表现出色,并具有更强的扩展性。原文链接:https://arxiv.org/abs/2501.00663

Feb 19, 202511 min

【第141期】O1 Replication Journey:Part 3

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:O1 Replication Journey -- Part 3: Inference-time Scaling for Medical ReasoningSummaryThis research paper investigates the effectiveness of inference-time scaling in large language models (LLMs) for medical reasoning tasks. The authors explore how increasing the processing time during inference improves the accuracy of LLMs on complex medical benchmarks like MedQA and JAMA Clinical Challenges. They introduce a novel journey learning approach, using knowledge distillation to generate high-quality training data for improved reasoning chains. Their experiments show that longer inference times correlate with better performance, especially for more challenging tasks, though sufficient LLM capacity is crucial. The study also examines the utility of majority voting as a means to scale inference-time computations.这篇研究论文探讨了推理时扩展在大型语言模型(LLMs)在医学推理任务中的有效性。作者研究了在推理过程中增加处理时间如何提高LLMs在复杂医学基准任务(如MedQA和JAMA临床挑战)上的准确性。他们提出了一种新颖的“旅程学习”方法,利用知识蒸馏生成高质量的训练数据,以改善推理链条。实验结果表明,较长的推理时间与更好的性能相关,尤其是在面对更具挑战性的任务时,尽管足够的LLM容量至关重要。研究还探讨了多数投票作为扩展推理时计算的一种手段的有效性。原文链接:https://arxiv.org/abs/2501.06458

Feb 18, 202518 min

【第140期】CNCD:新类型发现

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:CLIP-guided continual novel class discoverySummaryThis research paper introduces a novel method for Continual Novel Class Discovery (CNCD), a challenging machine learning problem focusing on teaching a model new classes without forgetting previously learned ones, especially when old data is unavailable. The proposed method leverages the CLIP model for guidance in identifying new classes and uses techniques like CutMix and prototype adaptation to improve representation learning and prevent forgetting. Experiments on several benchmark datasets demonstrate the method's effectiveness in balancing the learning of both new and old classes. The paper also explores the benefits of decoupling the training process for old and new classes and compares its performance to existing CNCD and novel class discovery methods. The authors conclude by discussing limitations and future directions for improving computational efficiency.这篇研究论文介绍了一种新颖的持续新类发现(CNCD)方法,这是一种具有挑战性的机器学习问题,主要集中在如何在没有旧数据的情况下,教授模型识别新类别而不忘记已学过的类别。所提方法利用CLIP模型为识别新类别提供指导,并采用诸如CutMix和原型适应等技术来提升表示学习和防止遗忘。在多个基准数据集上的实验表明,该方法在平衡新旧类别学习方面具有良好的效果。论文还探讨了将旧类别和新类别的训练过程解耦的好处,并将其与现有的CNCD和新类发现方法进行了比较。作者最后讨论了方法的局限性以及在提升计算效率方面的未来发展方向。原文链接:https://www.sciencedirect.com/science/article/abs/pii/S0950705124015545

Feb 17, 202517 min

【第139期】多语种控制机器人的能力评估

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art ModelsSummaryThis research paper evaluates the performance of several multilingual Small Language Models (SLMs) and one Arabic-centric Large Language Model (LLM) on vision-and-language navigation (VLN) tasks. Using the NavGPT framework and a bilingual (English and Arabic) version of the R2R dataset, the study assesses the models' reasoning and planning capabilities in both languages. The findings highlight the importance of robust multilingual models for effective VLN, especially in Arabic-speaking regions where such resources are limited. The study also identifies limitations in current models, including parsing issues and insufficient reasoning abilities, suggesting areas for future development. The quantitative and qualitative analyses compare the models' success rates, navigation errors, and planning strategies across languages.这篇研究论文评估了几种多语言小型语言模型(SLMs)和一个以阿拉伯语为中心的大型语言模型(LLM)在视觉-语言导航(VLN)任务中的表现。使用NavGPT框架和一个双语(英语和阿拉伯语)版本的R2R数据集,研究评估了这些模型在两种语言中的推理和规划能力。研究结果强调了强大多语言模型在有效VLN中的重要性,特别是在阿拉伯语地区,这些资源仍然较为匮乏。研究还指出了当前模型的局限性,包括语法解析问题和不足的推理能力,并提出了未来发展的方向。通过定量和定性分析,论文比较了这些模型在不同语言中的成功率、导航错误和规划策略。原文链接:https://arxiv.org/abs/2501.05478

Feb 16, 202512 min

【第138期】ParGo:弥合视觉与语言之间的鸿沟

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:ParGo: Bridging Vision-Language with Partial and Global ViewsSummaryThis research introduces ParGo, a novel vision-language projector designed to improve multimodal large language models (MLLMs). ParGo bridges the gap between vision and language by integrating both global and partial views of images, addressing the limitations of previous methods that overemphasize prominent regions. A new dataset, ParGoCap-1M-PT, containing one million detail-captioned images, was created to facilitate ParGo's training. Extensive experiments demonstrate ParGo's superior performance on various MLLM benchmarks, especially in tasks requiring detailed perception. The key innovation is ParGo's ability to leverage both broad and specific image information.这项研究介绍了ParGo,一种旨在提升多模态大型语言模型(MLLMs)的新型视觉-语言投影器。ParGo通过集成图像的全局视图和局部视图,弥合了视觉与语言之间的鸿沟,解决了以往方法过于强调显著区域的局限性。为了促进ParGo的训练,研究团队创建了一个新的数据集ParGoCap-1M-PT,其中包含一百万个详细标注图像。大量实验表明,ParGo在多个MLLM基准测试中表现出色,尤其是在需要细致感知的任务上。其关键创新在于ParGo能够同时利用图像的广泛信息和特定信息。原文链接:https://arxiv.org/abs/2408.12928

Feb 15, 202516 min

【第137期】Agents, Sims and Assistants

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Agents Are Not EnoughSummaryThis research paper argues that current AI agents, while experiencing a resurgence, are insufficient for creating truly effective and sustainable AI systems. The authors analyze past failures of various agent architectures, identifying limitations in generalization, scalability, coordination, robustness, and ethical considerations. They propose a new ecosystem incorporating Agents, Sims (user representations), and Assistants to overcome these challenges. This three-part system aims to improve personalization, trust, and value generation, ultimately leading to more successful and widely accepted AI agents. The paper concludes by suggesting the need for standardization to foster a thriving agent-based ecosystem.这篇研究论文指出,尽管当前AI代理正经历复兴,但它们仍不足以创造出真正有效和可持续的AI系统。作者分析了过去各种代理架构的失败,识别出了在泛化、可扩展性、协调性、鲁棒性和伦理考量方面的局限性。他们提出了一个新的生态系统,结合了代理、模拟器(用户表征)和助手,以克服这些挑战。这个三部分系统旨在改善个性化、信任和价值生成,最终促使AI代理更加成功且被广泛接受。论文最后建议,需要通过标准化来促进一个繁荣的基于代理的生态系统。原文链接:https://www.arxiv.org/abs/2412.16241

Feb 14, 202530 min

【第136期】R3GAN:简化的生成对抗网络

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:The GAN is dead; long live the GAN! A Modern GAN BaselineSummaryThis NeurIPS 2024 paper introduces R3GAN, a simplified Generative Adversarial Network (GAN) that achieves state-of-the-art performance. The authors achieve this by developing a novel, mathematically well-behaved loss function that eliminates the need for the ad-hoc training tricks common in previous GANs. This improved loss enables the use of modern neural network architectures, resulting in a more efficient and effective model. R3GAN surpasses existing GANs and diffusion models on several benchmark datasets, demonstrating the effectiveness of the proposed approach. The paper rigorously supports its claims through mathematical analysis and extensive empirical results. The authors also discuss the limitations of their approach and potential societal impacts of GAN technology.这篇NeurIPS 2024论文介绍了R3GAN,一种简化的生成对抗网络(GAN),实现了当前的最先进性能。作者通过开发一种新颖的、数学上表现良好的损失函数,消除了以往GAN中常见的临时训练技巧。这种改进的损失函数使得能够使用现代神经网络架构,从而使得模型更加高效和有效。R3GAN在多个基准数据集上超越了现有的GAN和扩散模型,展示了该方法的有效性。论文通过数学分析和大量实证结果严密支持其论点。作者还讨论了该方法的局限性以及GAN技术可能对社会带来的影响。原文链接:https://arxiv.org/abs/2501.05441

Feb 13, 202514 min

【第135期】Search-o1:文档中的推理

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Search-o1: Agentic Search-Enhanced Large Reasoning ModelsSummaryThe paper introduces Search-o1, a framework enhancing large reasoning models (LRMs) by integrating an agentic search workflow. This allows the LRM to dynamically retrieve external knowledge when encountering uncertainties during complex reasoning tasks. A key component is the Reason-in-Documents module, which refines retrieved information to maintain coherent reasoning. Experiments across various domains demonstrate Search-o1's superior performance compared to existing methods, even rivaling human experts in certain areas. The framework addresses knowledge insufficiency, a major limitation of current LRMs, improving their reliability and versatility. The code is publicly available.这篇论文介绍了Search-o1,一个通过集成代理式搜索工作流来增强大型推理模型(LRMs)的框架。该框架使LRM在遇到复杂推理任务中的不确定性时,能够动态地检索外部知识。一个关键组件是“文档中的推理”模块(Reason-in-Documents),该模块通过精炼检索到的信息,保持推理的一致性。跨多个领域的实验表明,Search-o1在性能上优于现有方法,甚至在某些领域能够与人类专家相媲美。该框架解决了当前LRM的知识不足问题,提升了模型的可靠性和多功能性。代码已公开。原文链接:https://arxiv.org/abs/2501.05366

Feb 12, 202510 min

【第134期】DPO Kernels:通过结合核方法来增强直接偏好优化

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference OptimizationSummaryThis research paper introduces DPO-Kernels, an improved method for aligning large language models (LLMs) with human preferences. It enhances Direct Preference Optimization (DPO) by incorporating kernel methods for richer feature transformations and diverse divergence measures for increased robustness. A data-driven approach automatically selects the optimal kernel-divergence pair, eliminating manual tuning. Furthermore, a Hierarchical Mixture of Kernels (HMK) framework combines local and global kernels to balance fine-grained and large-scale dependencies. The paper also explores generalization, overfitting, and ethical considerations related to fairness, bias, and privacy.这篇研究论文介绍了DPO-Kernels,一种改进的大型语言模型(LLMs)与人类偏好对齐的方法。它通过结合核方法来增强直接偏好优化(DPO),实现了更丰富的特征转换,并采用多样的散度度量来提高模型的鲁棒性。该方法采用数据驱动的方式自动选择最优的核-散度对,从而避免了手动调整。此外,论文提出了一种层次化的核混合(HMK)框架,结合了局部和全局核,以平衡细粒度和大规模依赖关系。论文还探讨了泛化能力、过拟合问题以及与公平性、偏见和隐私相关的伦理考虑。原文链接:https://arxiv.org/abs/2501.03271

Feb 11, 202516 min

【第133期】Meta-CoT:朝着系统2推理的方向发展

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-ThoughtSummaryThe paper "Towards System 2 Reasoning in LLMs" explores methods for improving the reasoning capabilities of large language models (LLMs). It introduces Meta Chain-of-Thought (Meta-CoT), a framework that models the reasoning process itself, going beyond traditional Chain-of-Thought prompting. The authors investigate using search algorithms, synthetic data, and reinforcement learning to train models that generate Meta-CoTs. Empirical results and scaling laws related to inference-time computation and the generator-verifier gap are presented, along with open research questions regarding the emergence of more human-like reasoning in AI. The included example problem-solving attempts illustrate different approaches to this challenge.论文《朝着系统2推理的方向发展》探讨了提升大语言模型(LLMs)推理能力的方法。文章提出了“元思维链”(Meta Chain-of-Thought,Meta-CoT)框架,该框架将推理过程本身建模,超越了传统的思维链提示方法。作者研究了使用搜索算法、合成数据和强化学习来训练生成Meta-CoT的模型。文章展示了与推理时计算和生成器-验证器差距相关的经验结果和扩展法则,并提出了关于AI中更类似人类推理出现的开放研究问题。文中所包含的示例问题解决尝试展示了应对这一挑战的不同方法。原文链接:https://arxiv.org/abs/2501.04682

Feb 10, 202522 min

【第132期】Agent Laboratory:科学研究助手

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Agent Laboratory: Using LLM Agents as Research AssistantsSummaryThe document details Agent Laboratory, an open-source framework using large language models (LLMs) to automate the scientific research process. It progresses through literature review, experimentation, and report writing stages, with human researchers providing feedback. Experiments show that Agent Laboratory significantly reduces research costs and that the o1-preview LLM backend produces the best results. The framework also includes a co-pilot mode enabling greater human involvement, improving research quality. However, limitations like LLM hallucinations and challenges with automated self-evaluation are discussed.本文介绍了Agent Laboratory,一个开源框架,利用大规模语言模型(LLMs)来自动化科学研究过程。该框架涵盖文献综述、实验设计和报告撰写等阶段,并通过人类研究者提供反馈进行调整。实验表明,Agent Laboratory 能显著降低研究成本,而 o1-preview LLM 后端在性能上表现最佳。该框架还包括一个副驾驶模式,允许更高程度的人类参与,从而提高研究质量。然而,论文也讨论了诸如 LLM 假象和自动化自我评估等局限性和挑战。原文链接:https://arxiv.org/abs/2501.04227

Feb 9, 202526 min

【第131期】Orient Anything:一种用于估计图像中物体方向的模型

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D ModelsSummaryThe paper introduces Orient Anything, a novel model for estimating object orientation in images. It addresses the challenge of limited labeled data by generating a large dataset of rendered 3D models with precise orientation annotations. The model uses a probability distribution fitting approach for robust orientation prediction, improving accuracy on both rendered and real images. Furthermore, the research demonstrates Orient Anything's superior performance compared to existing methods and its potential applications in spatial reasoning and image generation. Ablation studies validate key design choices, showcasing the model's effectiveness and robustness.这篇论文介绍了Orient Anything,一种用于估计图像中物体方向的新型模型。该模型解决了有限标注数据的问题,通过生成大量渲染的 3D 模型,并提供精确的方向注释来扩充数据集。模型采用概率分布拟合方法进行稳健的方向预测,提高了在渲染图像和真实图像上的准确性。此外,研究表明,Orient Anything 在性能上优于现有方法,并展示了它在空间推理和图像生成等应用中的潜力。消融实验验证了关键设计选择,展示了该模型的有效性和鲁棒性。原文链接:https://arxiv.org/abs/2412.18605

Feb 8, 202512 min

【第130期】OS-Genesis:可为GUI Agent提供数据

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task SynthesisSummaryThis research paper introduces OS-Genesis, a novel pipeline for synthesizing high-quality and diverse data for training Graphical User Interface (GUI) agents. Unlike existing methods that rely on pre-defined tasks or human supervision, OS-Genesis uses an interaction-driven approach, allowing agents to explore environments and retrospectively derive tasks. A trajectory reward model ensures data quality, and experiments demonstrate OS-Genesis's superior performance on challenging benchmarks. The authors also analyze data diversity and the impact of the reward model. Finally, they discuss OS-Genesis' limitations and broader implications for digital automation.这篇研究论文介绍了OS-Genesis,一种新颖的数据合成流程,用于训练图形用户界面(GUI)代理。与依赖于预定义任务或人工监督的现有方法不同,OS-Genesis 采用互动驱动的方法,允许代理在环境中进行探索,并从中回溯推导任务。轨迹奖励模型确保数据质量,实验表明 OS-Genesis 在具有挑战性的基准测试中表现优异。作者还分析了数据多样性和奖励模型的影响。最后,论文讨论了 OS-Genesis 的局限性及其在数字自动化领域的更广泛意义。原文链接:https://arxiv.org/abs/2412.19723

Feb 7, 202520 min

【第129期】Sa2VA:Sam2+LLaVA

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and VideosSummaryThe research introduces Sa2VA, a unified model for understanding images and videos. Sa2VA combines the strengths of SAM-2 (video segmentation) and LLaVA (vision-language model) to perform various tasks like referring segmentation and conversation. A new dataset, Ref-SAV, with complex video scenes, was created to improve model performance. Experiments show Sa2VA achieves state-of-the-art results across multiple benchmarks, particularly in referring video object segmentation. The code, dataset, and models are publicly available.这项研究介绍了Sa2VA,一个统一的图像和视频理解模型。Sa2VA 结合了 SAM-2(视频分割)和 LLaVA(视觉语言模型)的优点,能够执行多种任务,如指代分割和对话。为提升模型性能,研究团队创建了一个新数据集 Ref-SAV,该数据集包含复杂的视频场景。实验结果表明,Sa2VA 在多个基准测试中取得了最先进的成果,尤其是在指代视频对象分割任务中表现突出。代码、数据集和模型已公开发布。原文链接:https://arxiv.org/abs/2501.04001

Feb 6, 202514 min

【第128期】MeCo:元数据调节与冷却

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Metadata Conditioning Accelerates Language Model Pre-trainingSummaryThis research paper introduces Metadata Conditioning then Cooldown (MeCo), a novel method for improving the efficiency and controllability of large language model pre-training. MeCo incorporates readily available metadata, such as URLs, to enhance the model's understanding of diverse data sources during training, then uses a "cooldown" phase to ensure functionality without metadata during inference. Experiments demonstrate that MeCo significantly accelerates pre-training, achieving comparable performance with less data and enabling better control over model outputs by conditioning inference prompts with metadata. The study explores various metadata types and ablates design choices to understand MeCo's effectiveness, showcasing its potential for creating more capable and steerable language models. Finally, the paper compares MeCo to existing techniques for data selection and metadata conditioning.这篇研究论文介绍了一种新颖的方法——元数据调节与冷却(Metadata Conditioning then Cooldown,简称 MeCo),旨在提高大规模语言模型预训练的效率和可控性。MeCo 利用现成的元数据(如 URL)来增强模型在训练过程中对多样化数据源的理解,然后通过“冷却”阶段确保推理时不依赖元数据。实验表明,MeCo 能显著加速预训练,在使用更少数据的情况下实现相当的性能,并通过将推理提示与元数据结合来更好地控制模型输出。研究还探索了各种元数据类型,并通过消融实验分析了设计选择,以理解 MeCo 的有效性,展示了它在创建更强大且可调控语言模型方面的潜力。最后,论文将 MeCo 与现有的数据选择和元数据调节技术进行了比较。原文链接:https://arxiv.org/abs/2501.01956

Feb 5, 202517 min

【第127期】隐式 PRM:过程奖励模型

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Free Process Rewards without Process LabelsSummaryThis research paper proposes a cost-effective method for training process reward models (PRMs), which evaluate the intermediate steps of a reasoning process. Unlike existing PRMs requiring costly step-level labels, the authors demonstrate that a strong PRM can be implicitly learned at no extra cost by training an outcome reward model (ORM) with a specific reward parameterization. Their method, termed "implicit PRM," outperforms existing baselines on mathematical reasoning tasks while significantly reducing data collection and training overhead. Experiments explore various instantiations of the implicit PRM with different loss functions, showing consistent improvements and data efficiency. The findings suggest a paradigm shift in PRM training approaches, making them more accessible for broader applications.这篇研究论文提出了一种具有成本效益的训练过程奖励模型(PRMs)的方法,该模型用于评估推理过程中的中间步骤。与现有需要高成本步骤级标签的 PRM 不同,作者展示了一种通过特定奖励参数化训练结果奖励模型(ORM)来隐式学习强大的 PRM,且没有额外成本。该方法被称为“隐式 PRM”,在数学推理任务中优于现有基准,并显著减少了数据收集和训练开销。实验探索了使用不同损失函数的隐式 PRM 的各种实现,显示出一致的性能提升和数据效率。这些发现表明,PRM 训练方法可能迎来范式转变,使其在更广泛的应用中变得更加可访问。原文链接:https://arxiv.org/abs/2412.01981

Feb 4, 202515 min

【第126期】ICAL:VLM的上下文抽取学习

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of ThoughtSummaryThis NeurIPS 2024 paper introduces In-Context Abstraction Learning (ICAL), a method that allows Vision-Language Models (VLMs) to learn from suboptimal demonstrations and human feedback. ICAL generates its own high-quality examples by abstracting noisy trajectories, correcting errors, and annotating cognitive abstractions like causal relationships and subgoals. The resulting examples significantly improve VLM performance on three benchmarks (TEACh, VisualWebArena, and Ego4D), surpassing state-of-the-art results. The paper also explores the efficiency gains and continual learning capabilities of ICAL, showing reduced reliance on human feedback and environment interactions over time. Furthermore, the impact of fine-tuning the VLM on ICAL's learned examples is evaluated.这篇 NeurIPS 2024 论文介绍了一种名为上下文抽象学习(In-Context Abstraction Learning,简称 ICAL)的方法,该方法使视觉-语言模型(VLMs)能够从不完美的示范和人类反馈中学习。ICAL 通过抽象噪声轨迹、纠正错误,并标注认知抽象(如因果关系和子目标),生成自己的高质量示例。这些生成的示例显著提升了 VLM 在三个基准测试(TEACh、VisualWebArena 和 Ego4D)上的表现,超越了当前的最先进成果。论文还探讨了 ICAL 的效率提升和持续学习能力,显示随着时间的推移,对人类反馈和环境交互的依赖减少。此外,论文还评估了在 ICAL 学到的示例上对 VLM 进行微调的影响。原文链接:https://arxiv.org/abs/2406.14596

Feb 3, 202517 min

【第125期】GraphAgent:一种用于分析结构化(图形)和非结构化(文本)数据的自动化代理

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:GraphAgent: Agentic Graph Language AssistantSummaryThis paper introduces GraphAgent, a novel automated agent pipeline designed for analyzing both structured (graph) and unstructured (textual) data. GraphAgent uses three key agents: a Graph Generator Agent to create knowledge graphs from text, a Task Planning Agent to interpret user queries, and a Task Execution Agent to perform predictive or generative tasks. The system is evaluated on various datasets, showcasing superior performance compared to state-of-the-art methods in both predictive and generative tasks, particularly with smaller model sizes and zero-shot learning. The authors make their work open-source.本文提出了GraphAgent,一种用于分析结构化(图形)和非结构化(文本)数据的新型自动化代理管道。GraphAgent 使用三个关键代理:图生成代理(Graph Generator Agent)用于从文本中创建知识图谱,任务规划代理(Task Planning Agent)用于解释用户查询,任务执行代理(Task Execution Agent)用于执行预测或生成任务。该系统在多个数据集上进行了评估,展示了在预测和生成任务中优于最先进方法的表现,特别是在较小模型和零-shot 学习的情况下。作者将其工作开源。原文链接:https://arxiv.org/abs/2412.17029

Feb 2, 202519 min

【第124期】面向通用机器人控制的VLA模型

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action ModelsSummaryThis research paper explores the creation of Vision-Language-Action (VLA) models for generalist robot control. The authors investigate key design choices in VLAs, including the selection of Vision-Language Model (VLM) backbones, optimal VLA architectures, and the effective use of cross-embodiment data. Through extensive experimentation, they identify superior VLA structures and backbones, achieving state-of-the-art performance on simulated and real-world robotic tasks. A new framework, RoboVLMs, is introduced to simplify the process of creating VLAs and is made publicly available. The findings highlight the significant advantages of VLMs for generalist robot policies and offer valuable guidance for future VLA development.本研究探讨了面向通用机器人控制的视觉-语言-行动(Vision-Language-Action, VLA)模型的创建。作者研究了 VLA 的关键设计选择,包括视觉-语言模型(Vision-Language Model, VLM)骨干网的选择、最优 VLA 架构,以及跨体态数据的有效使用。通过大量实验,他们确定了优越的 VLA 结构和骨干网,在模拟和实际机器人任务中达到了最先进的性能。研究还提出了一个新框架 RoboVLMs,简化了创建 VLA 的过程,并公开发布。研究结果强调了 VLM 在通用机器人策略中的显著优势,并为未来 VLA 的发展提供了宝贵的指导。原文链接:https://arxiv.org/abs/2412.14058

Feb 1, 202518 min

【第123期】Cache-augmented generation (CAG)

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge TasksSummaryThis research paper introduces cache-augmented generation (CAG) as a more efficient alternative to retrieval-augmented generation (RAG) for knowledge-intensive tasks. CAG preloads all relevant knowledge into a large language model (LLM), eliminating the need for real-time retrieval and its associated latency and errors. Experiments using SQuAD and HotPotQA datasets demonstrate CAG's superior performance and speed, especially when the knowledge base is manageable in size. The authors highlight the advantages of CAG's simplified architecture and improved efficiency, suggesting it as a robust solution for specific applications. The paper concludes by exploring potential hybrid approaches combining preloading with selective retrieval.本文提出了缓存增强生成(cache-augmented generation, CAG),作为知识密集型任务中比检索增强生成(retrieval-augmented generation, RAG)更高效的替代方案。CAG 将所有相关知识预加载到大型语言模型(LLM)中,消除了实时检索的需求,避免了其相关的延迟和错误。通过在 SQuAD 和 HotPotQA 数据集上的实验,展示了 CAG 在性能和速度上的优越性,尤其在知识库规模可控的情况下。作者强调了 CAG 简化架构和提高效率的优势,建议其作为特定应用中的一种可靠解决方案。最后,论文探讨了结合预加载和选择性检索的潜在混合方法。原文链接:https://arxiv.org/abs/2412.15605

Jan 31, 202511 min

【第122期】HuatuoGPT-o1:医学推理大模型

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMsSummaryThis research introduces HuatuoGPT-o1, a large language model (LLM) specialized for complex medical reasoning. The model is trained using a novel two-stage approach: first, a search-based strategy learns complex reasoning trajectories from a newly created dataset of 40,000 verifiable medical problems; second, reinforcement learning further refines this ability using verifier feedback. HuatuoGPT-o1 significantly outperforms existing general and medical LLMs on various benchmarks, demonstrating the effectiveness of the proposed method. The study also explores the reliability of the LLM-based verifier and investigates the impact of different reasoning strategies and RL algorithms. Finally, the approach is successfully extended to the Chinese medical domain, highlighting its broad applicability.本研究提出了HuatuoGPT-o1,一种专门用于复杂医学推理的大型语言模型(LLM)。该模型采用了一种新颖的两阶段训练方法:首先,通过基于搜索的策略,从新创建的包含40,000个可验证医学问题的数据集中学习复杂的推理轨迹;其次,通过强化学习(RL)使用验证器反馈进一步优化该能力。HuatuoGPT-o1 在多个基准测试中显著优于现有的通用和医学 LLM,验证了所提方法的有效性。研究还探讨了基于 LLM 的验证器的可靠性,并研究了不同推理策略和强化学习算法的影响。最后,该方法成功扩展到中文医学领域,突显了其广泛的应用潜力。原文链接:https://arxiv.org/abs/2412.18925

Jan 30, 202520 min

【第121期】一种新型的蒙特卡罗符合性预测

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Conformal prediction under ambiguous ground truthSummaryThis research paper proposes a novel Monte Carlo Conformal Prediction (CP) method to address uncertainty quantification in classification tasks with ambiguous ground truth labels. Standard CP methods often rely on "voted" labels derived from aggregated expert opinions, ignoring inherent label uncertainty. The proposed Monte Carlo CP leverages expert opinions to create a non-degenerate label distribution, generating synthetic pseudo-labels to improve coverage guarantees. The authors demonstrate the method's effectiveness through experiments on skin condition classification, showing improvements over existing CP techniques in handling ambiguous labels. The paper also explores extensions to multi-label classification and robust CP with data augmentation.本研究提出了一种新型的蒙特卡罗符合性预测(Monte Carlo Conformal Prediction, CP)方法,用于解决具有模糊真实标签的分类任务中的不确定性量化问题。标准的 CP 方法通常依赖于通过聚合专家意见得出的“投票”标签,忽略了标签固有的不确定性。所提的蒙特卡罗 CP 利用专家意见创建一个非退化的标签分布,生成合成伪标签,以提高覆盖保证。作者通过皮肤病分类实验验证了该方法的有效性,表明其在处理模糊标签时相比现有的 CP 技术有所改进。论文还探讨了该方法在多标签分类和基于数据增强的稳健 CP 中的扩展应用。原文链接:https://arxiv.org/abs/2307.09302

Jan 29, 202514 min

【第120期】iTransformer:Inverted Transformers理解时间序列问题

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:iTransformer: Inverted Transformers Are Effective for Time Series ForecastingSummaryThe paper introduces iTransformer, a novel architecture for time series forecasting that inverts the standard Transformer structure. Instead of embedding multiple variates at each timestamp, iTransformer embeds each time series individually as a token, applying attention to capture multivariate correlations and feed-forward networks to learn series-specific representations. This approach achieves state-of-the-art results on several real-world datasets, showcasing improved performance and generalization compared to existing Transformer-based and linear models, particularly with longer lookback windows. The authors provide extensive experimental results and analysis to support their claims.本文提出了iTransformer,一种用于时间序列预测的新型架构,采用了与标准 Transformer 结构相反的设计。iTransformer 不是在每个时间戳嵌入多个变量,而是将每个时间序列单独嵌入为一个标记,使用注意力机制捕捉多变量之间的相关性,并通过前馈网络学习序列特定的表示。该方法在多个真实世界数据集上取得了最先进的结果,展示了相比现有基于 Transformer 的模型和线性模型,特别是在较长回溯窗口下的性能提升和泛化能力。作者提供了大量实验结果和分析来支持他们的论点。原文链接:arxiv.org

Jan 28, 20259 min

【第119期】DRT-o1:一种旨在改进包含明喻和隐喻句子翻译的新型神经机器翻译模型

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-ThoughtSummaryThe paper introduces DRT-o1, a novel neural machine translation model designed to improve the translation of sentences containing similes and metaphors. DRT-o1 leverages a multi-agent framework to simulate extended reasoning during the translation process, generating a long chain of thought. This framework comprises a translator, an advisor, and an evaluator, iteratively refining the translation. The resulting data is then used to fine-tune the model, achieving significant improvements in BLEU, CometKiwi, and CometScore compared to baseline models. The model's effectiveness is demonstrated through experiments on literature translation.本文提出了DRT-o1,一种旨在改进包含明喻和隐喻句子翻译的新型神经机器翻译模型。DRT-o1 利用多智能体框架,在翻译过程中模拟扩展推理,生成一条长链式思维。该框架由一个翻译器、一个顾问和一个评估器组成,迭代优化翻译结果。生成的数据随后用于微调模型,与基线模型相比,在 BLEU、CometKiwi 和 CometScore 等指标上实现了显著提升。通过文学翻译实验,验证了该模型的高效性。原文链接:https://arxiv.org/abs/2412.17498

Jan 27, 202512 min

【第118期】Mulberry:使用CoMCTS做类o1的多模态大模型

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree SearchSummaryThis research paper introduces Mulberry, a series of multimodal large language models (MLLMs) designed for improved reasoning and reflection. The key innovation is CoMCTS, a novel Collective Monte Carlo Tree Search method that leverages multiple models to collaboratively identify effective reasoning paths. CoMCTS generates the Mulberry-260k dataset, featuring richly annotated reasoning trees for diverse multimodal questions. Extensive experiments demonstrate Mulberry's superior performance on various benchmarks compared to existing MLLMs. The paper concludes by highlighting CoMCTS and Mulberry-260k as valuable resources for future research in MLLM reasoning.本文提出了Mulberry,一系列多模态大型语言模型(MLLMs),旨在提升推理和反思能力。其核心创新是CoMCTS(集体蒙特卡罗树搜索),一种新型方法,利用多个模型协作识别有效的推理路径。CoMCTS 生成了 Mulberry-260k 数据集,其中包含针对多样化多模态问题的丰富注释推理树。大量实验表明,Mulberry 在多个基准测试上的性能优于现有的多模态语言模型。论文总结指出,CoMCTS 和 Mulberry-260k 是未来多模态语言模型推理研究的宝贵资源。原文链接:https://arxiv.org/abs/2412.18319

Jan 26, 202511 min

【第117期】ExploreToM:一种用于生成复杂且多样化的心智理论

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoningSummaryThe paper introduces ExploreToM, a novel framework for generating complex and diverse theory-of-mind (ToM) datasets for evaluating and training large language models (LLMs). ExploreToM uses an A* search algorithm and a domain-specific language to create challenging story scenarios, revealing significant weaknesses in current LLMs' ToM abilities. The generated data, available online, demonstrates that state-of-the-art LLMs struggle with fundamental skills like state tracking and show surprisingly low accuracy on the generated tasks. Fine-tuning LLMs on ExploreToM data significantly improves their performance on existing ToM benchmarks, highlighting the framework's utility for advancing ToM research. The authors also explore the underlying reasons for LLMs' poor ToM performance, pointing to data biases and the need for targeted training.本文提出了ExploreToM,一种用于生成复杂且多样化的心智理论(Theory of Mind, ToM)数据集的新框架,以评估和训练大型语言模型(LLMs)。ExploreToM 利用 A* 搜索算法和领域特定语言创建具有挑战性的故事场景,揭示了当前 LLM 在 ToM 能力上的显著不足。生成的数据集已在线公开,表明最先进的 LLM 在诸如状态跟踪等基本技能上表现不佳,并且在这些任务上的准确率意外地低。将 LLM 在 ExploreToM 数据集上进行微调后,其在现有 ToM 基准测试中的表现显著提升,突显了该框架对推进 ToM 研究的价值。作者还探讨了 LLM 在 ToM 表现不佳的潜在原因,指出数据偏差以及对有针对性训练的需求。原文链接:https://arxiv.org/abs/2412.12175

Jan 25, 202525 min

【第116期】LLM Inference-Time自我提升综述

Seventy3: 用NotebookLM将论文生成播客,让大家跟着AI一起进步。今天的主题是:A Survey on LLM Inference-Time Self-ImprovementSummaryThis research survey explores Large Language Model (LLM) Inference-Time Self-Improvement (ITSI), techniques enhancing LLM performance at inference without retraining. The authors categorize ITSI methods into three groups: Independent, improving decoding processes; Context-Aware, leveraging external context or data; and Model-Aided, using other models for collaboration. A comprehensive taxonomy of existing ITSI methods is presented, along with a discussion of challenges and future research directions, such as addressing biases and improving efficiency. The survey draws on recent publications from top AI conferences. Finally, ethical considerations, including bias and economic/environmental impact, are highlighted.本研究综述探讨了大型语言模型(LLM)在推理阶段自我改进(Inference-Time Self-Improvement, ITSI)的技术,这些技术无需重新训练即可提升模型性能。作者将 ITSI 方法分为三类:独立型(优化解码过程)、上下文感知型(利用外部上下文或数据)和模型辅助型(借助其他模型协作)。文章提供了现有 ITSI 方法的全面分类,并讨论了当前的挑战和未来研究方向,如解决偏见问题和提高效率。该综述参考了最近顶级人工智能会议的研究成果。最后,文章还强调了伦理考量,包括偏见以及经济和环境影响。原文链接:https://arxiv.org/abs/2412.14352

Jan 24, 202517 min