PLAY PODCASTS
Daily Paper Cast

Daily Paper Cast

1,918 episodes — Page 4 of 39

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Apr 16, 202625 min

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Apr 16, 202621 min

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Apr 16, 202622 min

Toward Autonomous Long-Horizon Engineering for ML Research

Apr 16, 202624 min

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Apr 16, 202621 min

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Apr 15, 202624 min

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Apr 15, 202621 min

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Apr 15, 202621 min

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Apr 15, 202621 min

Strips as Tokens: Artist Mesh Generation with Native UV Segmentation

Apr 15, 202621 min

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Apr 15, 202622 min

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

Apr 15, 202623 min

CocoaBench: Evaluating Unified Digital Agents in the Wild

Apr 15, 202622 min

CodeTracer: Towards Traceable Agent States

Apr 15, 202623 min

WildDet3D: Scaling Promptable 3D Detection in the Wild

Apr 14, 202625 min

FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Apr 14, 202621 min

EXAONE 4.5 Technical Report

Apr 14, 202623 min

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Apr 14, 202622 min

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Apr 14, 202623 min

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Apr 11, 202624 min

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Apr 11, 202622 min

RAGEN-2: Reasoning Collapse in Agentic RL

Apr 10, 202625 min

MARS: Enabling Autoregressive Models Multi-Token Generation

Apr 10, 202623 min

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Apr 10, 202621 min

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Apr 9, 202624 min

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Apr 9, 202622 min

Learning to Retrieve from Agent Trajectories

Apr 9, 202622 min

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Apr 9, 202624 min

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Apr 9, 202623 min

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

Apr 9, 202621 min

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Apr 9, 202622 min

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Apr 9, 202624 min

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Apr 9, 202625 min

Watch Before You Answer: Learning from Visually Grounded Post-Training

Apr 9, 202620 min

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

Apr 8, 202623 min

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Apr 8, 202623 min

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Apr 8, 202622 min

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Apr 8, 202621 min

Adam's Law: Textual Frequency Law on Large Language Models

Apr 8, 202622 min

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Apr 8, 202623 min

ClawArena: Benchmarking AI Agents in Evolving Information Environments

Apr 8, 202621 min

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Apr 8, 202622 min

LightThinker++: From Reasoning Compression to Memory Management

Apr 8, 202620 min

Self-Distilled RLVR

Apr 7, 202621 min

A Simple Baseline for Streaming Video Understanding

Apr 7, 202621 min

Token Warping Helps MLLMs Look from Nearby Viewpoints

Apr 7, 202620 min

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Apr 7, 202622 min

Ep 1721DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

🤗 Upvotes: 144 | cs.LG, cs.CL Authors: Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang Title: DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models Arxiv: http://arxiv.org/abs/2603.26164v1 Abstract: Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

Apr 4, 202627 min

Ep 1720The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

🤗 Upvotes: 102 | cs.AI Authors: Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan Title: The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook Arxiv: http://arxiv.org/abs/2604.02029v1 Abstract: Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.

Apr 4, 202622 min

Ep 1719Generative World Renderer

🤗 Upvotes: 76 | cs.CV Authors: Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang Title: Generative World Renderer Arxiv: http://arxiv.org/abs/2604.02329v1 Abstract: Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

Apr 4, 202622 min